Presented by Sveinung Gundersen
Sveinung Gundersen1, Matúš Kalaš2,3, Osman Abul4, Arnoldo Frigessi5,6, Eivind Hovig1,7,8, and Geir Kjetil Sandve8
1 Department of Tumor Biology, The Norwegian Radium Hospital, Oslo University Hospital, Montebello, 0310 Oslo, Norway.
2 Computational Biology Unit, Uni Computing, Thormøhlensgate 55, 5008 Bergen, Norway.
3 Department of Informatics, University of Bergen, Thormøhlensgate 55, 5008 Bergen, Norway.
4 TOBB University of Economics and Technology, Ankara, Turkey.
5 Statistics For Innovation, Norwegian Computing Center, 0314 Oslo, Norway.
6 Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Blindern, 0317 Oslo, Norway.
7 Institute for Medical Informatics, The Norwegian Radium Hospital, Oslo University Hospital, Montebello, 0310 Oslo, Norway.
8 Department of Informatics, University of Oslo, Blindern, 0316 Oslo, Norway.
A host of alternative formats for representing whole-genome datasets, such as WIG, BED, GFF, and BedGraph, are currently in use, complicating analysis and tool development. The need for different formats are driven partly by the need of extra columns for specific content, but also because of differences reflecting the underlying models of the data. We have delineated fifteen different "track types", representing different intrinsic data models, starting from simple types such as "points" and "segments" to more complex types.
GTrack 1.0 (gtrack.no) is a recently defined tabular format that can handle data of all fifteen different track types. It supports customizable specification of columns, customizable value types, as well as graph-type data with weights, improving on the built-in "interval" data format in Galaxy. GTrack can represent the same information as standard formats, in addition to supporting extensions and subtype specifications without the need to rewrite parsers. In addition, GTrack can be used for 3D-type datasets, such as Hi-C data, for which no standard formats exist
GTrack is fully supported by The Genomic HyperBrowser (hyperbrowser.uio.no). The parsers and underlying binary storage scheme of the HyperBrowser system has now been extracted to a separate Python library. The library makes use of a vectorized storage scheme based on NumPy objects, which allows C-type analysis performance using the Python language. The library also supports most other common data formats, including conversions between them. We believe GTrack, and the associated binary library, is ideal for use within Galaxy tools as a backbone for high-speed analysis.