Ubuntu Manpage: bamsort - sort BAM files by coordinate or query name

Provided by: biobambam2_2.0.183+ds-1_amd64

NAME

       bamsort - sort BAM files by coordinate or query name

SYNOPSIS

       bamsort [options]

DESCRIPTION

       bamsort  reads  a  BAM,  SAM  or  CRAM  file,  sorts  it by coordinate (lexicographical by
       reference sequence id and position on reference sequence), query name (possibly  including
       the HI aux tag for ordering alignments featuring the same query name), hash value computed
       for the query name or an aux tag value and writes the sorted file  in  BAM,  SAM  or  CRAM
       format.

       Lexicographical order denotes that pairs (a,b) and (c,d) will be ordered such that (a,b) <
       (c,d) if either a < c or a = c and b < d. For coordinates this means that  the  alignments
       are  first grouped by reference sequence id (i.e. all alignments for one chromosome appear
       in one block) and within the block for each reference sequence the alignments are  ordered
       by the start position on this sequence.

       The  order  by query name decomposes the read names into parts containing numbers and such
       containing no number. A  read  name  A15_30_C50  will  for  instance  be  split  into  the
       components  A,  15,  _,  30,  _C  and  50.  The  comparison  of  read  names  is performed
       lexicographically along this decomposition, where number fields are compared  as  numbers.
       As  an  example  we  have  A15<B12  as  A<B and A9<A12 as A=A and 9<12 (where 9 and 12 are
       considered as numbers and not as the sequences of their digits).

       The order by hash value computes a hash value (effectively random number)  for  each  read
       name  and order the alignments by this number in increasing order. Alignments assigned the
       same hash value are ordered by query name.

       The order by aux tag compares alignments by the value of  a  given  aux  field  storing  a
       string value. This string comparison follows the same order used for comparing query names
       stated above. Alignments with the same aux value are sorted by coordinate order.

       If the memory buffer given is not sufficiently large to process the input file,  then  the
       program  writes  intermediate  results  to  a  temporary  file. This file can be large and
       depending on the compression of the input file larger than the input itself.

       The following key=value pairs can be given:

       SO=<coordinate|queryname|hash|tag|queryname_HI|queryname_lexicographic>:  set   the   sort
       order. Valid values are

       coordinate:
              sort alignments by coordinate

       queryname
              sort alignments by query name

       hash   sort  alignments  by  (Murmur3) hash of query name. This effectively puts them in a
              random order.

       tag    sort alignments by string aux field. The tag of the aux fields need to be  provided
              using the sorttag key.

       queryname_HI
              sort  alignments  by query name. Alignments with identical query name are sorted by
              the value of their HI aux field.

       queryname_lexicographic
              sort alignments by query name using a purely lexicographic  comparison  instead  of
              the more sophisticated version described above.

       level=<-1|0|1|9|11>: set compression level of the output BAM file. Valid values are

       -1:    zlib/gzip default compression level

       0:     uncompressed

       1:     zlib/gzip level 1 (fast) compression

       9:     zlib/gzip level 9 (best) compression

       If  libmaus  has  been compiled with support for igzip (see https://software.intel.com/en-
       us/articles/igzip-a-high-performance-deflate-compressor-with-optimizations-for-genomic-
       data) then an additional valid value is

       11:    igzip compression

       verbose=<1>: Valid values are

       1:     print progress report on standard error

       0:     do not print progress report

       blockmb=<1024>:  set  size of the internal memory sorting buffer in megabytes. The default
       buffer size is one gigabyte.

       tmpfile=<filename>: set the prefix for temporary file names

       disablevalidation=<0|1>: sets whether input validation is performed. Valid values are

       0:     validation is enabled (default)

       1:     validation is disabled

       md5=<0|1>: md5 checksum creation for output  file.  This  option  can  only  be  given  if
       outputformat=bam. Then valid values are

       0:     do not compute checksum. This is the default.

       1:     compute  checksum.  If  the md5filename key is set, then the checksum is written to
              the given file. If md5filename is unset, then no checksum will be computed.

       md5filename file name for md5 checksum if md5=1.

       index=<0|1>: compute BAM index  for  output  file.  This  option  can  only  be  given  if
       outputformat=bam. Then valid values are

       0:     do not compute BAM index. This is the default.

       1:     compute  BAM  index. If the indexfilename key is set, then the BAM index is written
              to the given file. If indexfilename is unset, then no BAM index will be computed.

       indexfilename file name for output BAM index if index=1.

       inputformat=<bam>: input file format.  All versions of bamsort come with support  for  the
       BAM  input  format.  If  the program in addition is linked to the io_lib package, then the
       following options are valid:

       bam:   BAM (see http://samtools.sourceforge.net/SAM1.pdf)

       sam:   SAM (see http://samtools.sourceforge.net/SAM1.pdf)

       cram:  CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit)

       outputformat=<bam>: output file format.  All versions of bamsort come with support for the
       BAM  output  format.  If the program in addition is linked to the io_lib package, then the
       following options are valid:

       bam:   BAM (see http://samtools.sourceforge.net/SAM1.pdf)

       sam:   SAM (see http://samtools.sourceforge.net/SAM1.pdf)

       cram:  CRAM  (see  http://www.ebi.ac.uk/ena/about/cram_toolkit).  This   format   is   not
              advisable for data sorted by query name.

       I=<[stdin]>: input filename, standard input if unset.

       O=<[stdout]>: output filename, standard output if unset.

       inputthreads=<[1]>: input helper threads, only valid for inputformat=bam.

       sortthreads=<[1]>: number of threads used for sorting.

       outputthreads=<[1]>: output helper threads, only valid for outputformat=bam.

       reference=<[]>:  reference FastA file for inputformat=cram and outputformat=cram. An index
       file (.fai) is required.

       range=<>: input range to be processed. This option  is  only  valid  if  the  input  is  a
       coordinate sorted and indexed BAM file

       fixmates=<0|1>:  fix  mate information as bamfixmateinformation would do. Input is assumed
       to be collated by query name (no changes will be applied to mates which are  not  adjacent
       in the input stream). By default this option is disabled.

       calmdnm=<0|1>:  calculate the MD and NM fields as a side effect. By default the fields are
       not calculated. Calculation is only performed if sorting is performed  by  coordinate.  If
       calmdnm=1,  then  the  parameter calmdnmreference in required.  The supported file formats
       can be found in the manual page for bammdnm.

       calmdnmreference=<[]>: name of reference sequence file if calmdnm=1.

       calmdnmrecompindetonly=<0|1>: compute MD/NM fields in the presence  of  indeterminate  (N)
       bases  only. This option is only relevant if calmdnm=1. By default the fields are computed
       for all mapped alignments if calmdnm=1.

       calmdnmwarnchange=<0|1>: warn if MD/NM field  which  was  computed  is  differing  from  a
       previously existing field. By default no warnings are produced.

       adddupmarksupport=<0|1>:  add  information required for streaming duplicate marking in the
       aux fields MS and MC. Input is assumed to be  collated  by  query  name.  This  option  is
       ignored unless fixmates=1. By default it is disabled.

       markduplicates=<[0]>:  mark  duplicate  read pairs and reads. This option can only be used
       when a name collated file (all reads for a name are consecutive in the  input)  is  sorted
       into coordinate order. In addition the input is required not to contain orphan reads (pair
       ends such that the other  end  of  the  pair  is  not  contained  in  the  file).  Setting
       markduplicates=1  implies  adddupmarksupport=1. The temporarily added auxiliary fields are
       removed during output generation. The markduplicates option is disabled by default.

       rmdup=<[0]>: remove the duplicates marked by the markduplicates option. As  this  requires
       markduplicates=1, the requirements stated for markduplicates also apply for rmdup.

       tag=<tag>  name of auxiliary field storing tag information for duplicate marking in string
       form. Read fragments or pairs with different tags will not be  considered  as  duplicates,
       even  they  would  be  according  to  their  mapping  coordinates. For pairs the tag field
       information of the first and second mate are concatenated to obtain the tag of the pair.

       nucltag=<tag> this option works like the tag option but  is  restricted  to  sequences  of
       nucleotides (A,C,G or T) as tags. The length of each tag sequence is not allowed to exceed
       15 bases. All tags are required to have the same length.  Each non  nucleotide  symbol  is
       mapped  to  A.  In contrast to the tag option, nucltag uses less memory for processing and
       can be expected to be faster.

       M=<stderr>: name of the metrics  file  for  duplicate  marking  (metrics  are  written  to
       standard error if not set)

       streaming=<0|1>: do not open input file(s) multiple times if set to 1. When given multiple
       input files bamsort concatenates the files on the fly and computes a merged header  before
       starting  the  data  processing.  Computing the header of the output file requires opening
       each input file. If each input file can only be opened once (as it may take the form of  a
       pipe  or  socket  connection), then bamsort will keep all the files open at the same time.
       Otherwise the files will be opened only  as  needed  to  keep  the  number  of  open  file
       descriptors lower.

       sorttag=: tag of aux field used for comparison when SO=tag.

       hash=<crc32prod>:  hash  used  for  producing  bamseqchksum  type  header fields in sorted
       output.

AUTHOR

       Written by German Tischler.

REPORTING BUGS

       Report bugs to <germant@miltenyibiotec.de>

COPYRIGHT

       Copyright © 2009-2016 German Tischler,  ©  2011-2014  Genome  Research  Limited.   License
       GPLv3+: GNU GPL version 3 <http://gnu.org/licenses/gpl.html>
       This  is free software: you are free to change and redistribute it.  There is NO WARRANTY,
       to the extent permitted by law.