Ubuntu Manpage: bgzip - Block compression/decompression utility

NAME

       bgzip - Block compression/decompression utility

       tabix - Generic indexer for TAB-delimited genome position files

SYNOPSIS

       bgzip [-cdhB] [-b virtualOffset] [-s size] [file]

       tabix  [-0lf]  [-p  ®.RB  [  -s  seqCol]  [-b  begCol] [-e endCol] [-S lineSkip] [-c metaChar] in.tab.bgz
       [region1 [region2 [...]]]

DESCRIPTION

       Tabix indexes a TAB-delimited genome position file in.tab.bgz and creates an  index  file  in.tab.bgz.tbi
       when  region  is absent from the command-line. The input data file must be position sorted and compressed
       by bgzip which has a gzip(1) like interface. After indexing, tabix is able to quickly retrieve data lines
       overlapping regions specified in the format "chr:beginPos-endPos". Fast data retrieval  also  works  over
       network  if  URI  is given as a file name and in this case the index file will be downloaded if it is not
       present locally.

OPTIONS OF TABIX

       -p STR    Input format for indexing. Valid values are: gff, bed, sam, vcf and psltab. This option  should
                 not  be  applied  together with any of -s, -b, -e, -c and -0; it is not used for data retrieval
                 because this setting is stored in the index file. [gff]

       -s INT    Column of sequence name. Option -s, -b, -e, -S, -c and -0 are all stored in the index file  and
                 thus not used in data retrieval. [1]

       -b INT    Column of start chromosomal position. [4]

       -e INT    Column of end chromosomal position. The end column can be the same as the start column. [5]

       -S INT    Skip first INT lines in the data file. [0]

       -c CHAR   Skip lines started with character CHAR. [#]

       -0        Specify that the position in the data file is 0-based (e.g. UCSC files) rather than 1-based.

       -h        Print the header/meta lines.

       -B        The second argument is a BED file. When this option is in use, the input file may not be sorted
                 or  indexed.  The  entire  input  will be read sequentially. Nonetheless, with this option, the
                 format of the input must be specificed correctly on the command line.

       -f        Force to overwrite the index file if it is present.

       -l        List the sequence names stored in the index file.

EXAMPLE

       (grep ^"#" in.gff; grep -v ^"#" in.gff | sort -k1,1 -k4,4n) | bgzip > sorted.gff.gz;

       tabix -p gff sorted.gff.gz;

       tabix sorted.gff.gz chr1:10,000,000-20,000,000;

NOTES

       It is straightforward to achieve overlap queries  using  the  standard  B-tree  index  (with  or  without
       binning)  implemented  in  all SQL databases, or the R-tree index in PostgreSQL and Oracle. But there are
       still many reasons to use tabix. Firstly, tabix directly works with a lot of  widely  used  TAB-delimited
       formats  such as GFF/GTF and BED. We do not need to design database schema or specialized binary formats.
       Data do not need to be duplicated in different formats, either. Secondly, tabix works on compressed  data
       files while most SQL databases do not. The GenCode annotation GTF can be compressed down to 4%.  Thirdly,
       tabix  is  fast.  The  same  indexing  algorithm is known to work efficiently for an alignment with a few
       billion short reads. SQL databases probably cannot easily handle data at this scale.  Last  but  not  the
       least,  tabix  supports  remote data retrieval. One can put the data file and the index at an FTP or HTTP
       server, and other users or even web services will be able to get a slice without downloading  the  entire
       file.

AUTHOR

       Tabix  was  written by Heng Li. The BGZF library was originally implemented by Bob Handsaker and modified
       by Heng Li for remote file access and in-memory caching.

NAME

SYNOPSIS

DESCRIPTION

OPTIONS OF TABIX

EXAMPLE

NOTES

AUTHOR

SEE ALSO