Ubuntu Manpage: bamdownsamplerandom - downsample a SAM, BAM or CRAM file

NAME

       bamdownsamplerandom - downsample a SAM, BAM or CRAM file

SYNOPSIS

       bamdownsamplerandom [options]

DESCRIPTION

       bamdownsamplerandom  reads  a SAM, BAM or CRAM file from standard input, randomly discards
       reads and writes the remaining reads to standard output in BAM format. For a pair of reads
       either  both  ends  are  discarded or both ends are kept. The order of reads in the output
       file may be different from the order in the input if the reads in the input file  are  not
       collated by their read name.

       The following key=value pairs can be given:

       p=<1>:  probability  for  a  pair of reads or a single end read to be kept. By default all
       reads are kept.

       seed=<>: seed used for the random number generator. By default the current time  is  used,
       i.e.  each  run of the program will select a different subset of reads from an input file.
       If the behaviour of the program needs to be reproducible a fixed number can be used as the
       random seed.

       I=<stdin>: input file name (data is read from standard input if this option is not given)

       inputformat=<bam>:  input file format All versions of bamtofastq come with support for the
       BAM input format. If the program in addition is linked to the  io_lib  package,  then  the
       following options are valid:

       bam:   BAM (see http://samtools.sourceforge.net/SAM1.pdf)

       sam:   SAM (see http://samtools.sourceforge.net/SAM1.pdf)

       cram:  CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit)

       level=<-1|0|1|9|11>: set compression level of the output BAM file. Valid values are

       -1:    zlib/gzip default compression level

       0:     uncompressed

       1:     zlib/gzip level 1 (fast) compression

       9:     zlib/gzip level 9 (best) compression

       If  libmaus  has  been compiled with support for igzip (see https://software.intel.com/en-
       us/articles/igzip-a-high-performance-deflate-compressor-with-optimizations-for-genomic-
       data) then an additional valid value is

       11:    igzip compression

       exclude=<SECONDARY,SUPPLEMENTARY>: Do not include reads in the output that have any of the
       given flags set. The flags are given separated by commas. Valid flags are:

       PAIRED:
              read was paired in sequencing

       PROPER_PAIR:
              read has been mapped as part of a proper pair

       UNMAP: read was not mapped

       MUNMAP:
              mate of read was not mapped

       REVERSE:
              read was mapped to the reverse strand

       MREVERSE:
              mate of read was mapped to the reverse strand

       READ1: read was first read of a pair during sequencing

       READ2: read was second read of a pair during sequencing

       SECONDARY:
              alignment is secondary, i.e. an alternative mapping to the primary alignment in the
              same file

       QCFAIL:
              read as marked as having failed quality control

       DUP:   read   is   marked   as  a  duplicate  of  another  read  in  the  same  file  (see
              bammarkduplicates)

       SUPPLEMENTARY:
              read is marked as supplementary alignment

       disablevalidation=<0>: Valid values are

       0:     run input file validation on alignments (this is the default)

       1:     do not check the validity of the input file (this may help for  some  broken  input
              files,  but it is a security risk as it can lead to the execution of arbitrary code
              through a forged input file).

       colhlog=<18> base two logarithm of the size of the hash  table  used  for  collation  (the
       default  value is 18 and should work reasonably well for most input files.  Please see the
       biobambam paper at arxiv.org/abs/1306.0836 for details).

       colsbs=<128M> size of hash table overflow list in bytes (the default is 128MB  and  should
       work   reasonably   well  for  most  input  files.  Please  see  the  biobambam  paper  at
       arxiv.org/abs/1306.0836 for details).

       T=<bamdownsamplerandom_hostname_pid_time> file name of temporary file used for collation

       ranges=<>: coordinate ranges selected from input. This option is only available for  input
       files  in BAM format which have a corresponding index (.bai file) and if input is via file
       (i.e. the I argument is set).  Valid ranges consist either of

       whole reference sequence:
              a whole reference sequence (e.g. "chr1")

       half open interval on reference sequence:
              an interval on a reference sequence half open on the right (e.g. "chr1:50000" which
              means alignments overlapping chr1 from position 50000 to the end of chr1)

       interval on reference sequence:
              an interval on a reference sequence (e.g. "chr1:50000-60000" which means alignments
              overlapping positions 50000 to 60000 on chr1)

       Multiple  ranges  are  separated  by  space  characters   (e.g.   ranges="chr1:10000-20000
       chr1:30000-40000").

       reference=:  file  name  of the reference for CRAM input files. If this key is unset, then
       the CRAM file header will be scanned for obtaining a reference file name.

       tmpfile=<filename>: prefix for temporary files. By default the temporary files are created
       in the current directory

       outputformat=<bam>: output file format.  All versions of bamsort come with support for the
       BAM output format. If the program in addition is linked to the io_lib  package,  then  the
       following options are valid:

       bam:   BAM (see http://samtools.sourceforge.net/SAM1.pdf)

       sam:   SAM (see http://samtools.sourceforge.net/SAM1.pdf)

       cram:  CRAM   (see   http://www.ebi.ac.uk/ena/about/cram_toolkit).   This  format  is  not
              advisable for data sorted by query name.

       O=<[stdout]>: output filename, standard output if unset.

       outputthreads=<[1]>: output helper threads, only valid for outputformat=bam.

       md5=<0|1>: md5 checksum creation for output  file.  This  option  can  only  be  given  if
       outputformat=bam. Then valid values are

       0:     do not compute checksum. This is the default.

       1:     compute  checksum.  If  the md5filename key is set, then the checksum is written to
              the given file. If md5filename is unset, then no checksum will be computed.

       md5filename file name for md5 checksum if md5=1.

       index=<0|1>: compute BAM index  for  output  file.  This  option  can  only  be  given  if
       outputformat=bam. Then valid values are

       0:     do not compute BAM index. This is the default.

       1:     compute  BAM  index. If the indexfilename key is set, then the BAM index is written
              to the given file. If indexfilename is unset, then no BAM index will be computed.

       indexfilename file name for output BAM index if index=1.

       hash=<0|1>: use hash of query name instead of a random number for  selection.  This  makes
       the  output  depend  on how random the hashes produced for the query names are, but it has
       the advantage of not requiring collation to keep pairs together.  In contast the order  of
       retained reads does not change for hash=1.

AUTHOR

       Written by German Tischler.

REPORTING BUGS

       Report bugs to <tischler@mpi-cbg.de>

COPYRIGHT

       Copyright  ©  2009-2014  German  Tischler,  ©  2011-2014 Genome Research Limited.  License
       GPLv3+: GNU GPL version 3 <http://gnu.org/licenses/gpl.html>
       This is free software: you are free to change and redistribute it.  There is NO  WARRANTY,
       to the extent permitted by law.