Ubuntu Manpage: bamdownsamplerandom - downsample a SAM, BAM or CRAM file

name
synopsis
description
author
reporting bugs
copyright

NAME

       bamdownsamplerandom - downsample a SAM, BAM or CRAM file

SYNOPSIS

       bamdownsamplerandom [options]

DESCRIPTION

       bamdownsamplerandom reads a SAM, BAM or CRAM file from standard input, randomly discards reads and writes
       the remaining reads to standard output in BAM format. For a pair of reads either both ends are  discarded
       or both ends are kept. The order of reads in the output file may be different from the order in the input
       if the reads in the input file are not collated by their read name.

       The following key=value pairs can be given:

       p=<1>: probability for a pair of reads or a single end read to be kept. By default all reads are kept.

       seed=<>: seed used for the random number generator. By default the current time is used, i.e. each run of
       the  program  will select a different subset of reads from an input file. If the behaviour of the program
       needs to be reproducible a fixed number can be used as the random seed.

       I=<stdin>: input file name (data is read from standard input if this option is not given)

       inputformat=<bam>: input file format All versions of bamtofastq come  with  support  for  the  BAM  input
       format. If the program in addition is linked to the io_lib package, then the following options are valid:

       bam:   BAM (see http://samtools.sourceforge.net/SAM1.pdf)

       sam:   SAM (see http://samtools.sourceforge.net/SAM1.pdf)

       cram:  CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit)

       level=<-1|0|1|9|11>: set compression level of the output BAM file. Valid values are

       -1:    zlib/gzip default compression level

       0:     uncompressed

       1:     zlib/gzip level 1 (fast) compression

       9:     zlib/gzip level 9 (best) compression

       If libmaus has been compiled with support for igzip (see https://software.intel.com/en-us/articles/igzip-
       a-high-performance-deflate-compressor-with-optimizations-for-genomic-data) then an additional valid value
       is

       11:    igzip compression

       exclude=<SECONDARY,SUPPLEMENTARY>:  Do  not  include reads in the output that have any of the given flags
       set. The flags are given separated by commas. Valid flags are:

       PAIRED:
              read was paired in sequencing

       PROPER_PAIR:
              read has been mapped as part of a proper pair

       UNMAP: read was not mapped

       MUNMAP:
              mate of read was not mapped

       REVERSE:
              read was mapped to the reverse strand

       MREVERSE:
              mate of read was mapped to the reverse strand

       READ1: read was first read of a pair during sequencing

       READ2: read was second read of a pair during sequencing

       SECONDARY:
              alignment is secondary, i.e. an alternative mapping to the primary alignment in the same file

       QCFAIL:
              read as marked as having failed quality control

       DUP:   read is marked as a duplicate of another read in the same file (see bammarkduplicates)

       SUPPLEMENTARY:
              read is marked as supplementary alignment

       disablevalidation=<0>: Valid values are

       0:     run input file validation on alignments (this is the default)

       1:     do not check the validity of the input file (this may help for some broken input files, but it  is
              a security risk as it can lead to the execution of arbitrary code through a forged input file).

       colhlog=<18> base two logarithm of the size of the hash table used for collation (the default value is 18
       and  should  work  reasonably  well  for  most  input  files.   Please  see  the   biobambam   paper   at
       arxiv.org/abs/1306.0836 for details).

       colsbs=<128M>  size of hash table overflow list in bytes (the default is 128MB and should work reasonably
       well for most input files. Please see the biobambam paper at arxiv.org/abs/1306.0836 for details).

       T=<bamdownsamplerandom_hostname_pid_time> file name of temporary file used for collation

       ranges=<>: coordinate ranges selected from input. This option is only available for input  files  in  BAM
       format  which  have  a  corresponding  index (.bai file) and if input is via file (i.e. the I argument is
       set).  Valid ranges consist either of

       whole reference sequence:
              a whole reference sequence (e.g. "chr1")

       half open interval on reference sequence:
              an interval on a reference sequence  half  open  on  the  right  (e.g.  "chr1:50000"  which  means
              alignments overlapping chr1 from position 50000 to the end of chr1)

       interval on reference sequence:
              an  interval  on  a reference sequence (e.g. "chr1:50000-60000" which means alignments overlapping
              positions 50000 to 60000 on chr1)

       Multiple ranges are separated by space characters (e.g. ranges="chr1:10000-20000 chr1:30000-40000").

       reference=: file name of the reference for CRAM input files. If this key is unset,  then  the  CRAM  file
       header will be scanned for obtaining a reference file name.

       tmpfile=<filename>: prefix for temporary files. By default the temporary files are created in the current
       directory

       outputformat=<bam>: output file format.  All versions of bamsort come with support  for  the  BAM  output
       format. If the program in addition is linked to the io_lib package, then the following options are valid:

       bam:   BAM (see http://samtools.sourceforge.net/SAM1.pdf)

       sam:   SAM (see http://samtools.sourceforge.net/SAM1.pdf)

       cram:  CRAM  (see  http://www.ebi.ac.uk/ena/about/cram_toolkit).  This  format  is not advisable for data
              sorted by query name.

       O=<[stdout]>: output filename, standard output if unset.

       outputthreads=<[1]>: output helper threads, only valid for outputformat=bam.

       md5=<0|1>: md5 checksum creation for output file. This option can only be given if outputformat=bam. Then
       valid values are

       0:     do not compute checksum. This is the default.

       1:     compute  checksum.  If the md5filename key is set, then the checksum is written to the given file.
              If md5filename is unset, then no checksum will be computed.

       md5filename file name for md5 checksum if md5=1.

       index=<0|1>: compute BAM index for output file. This option can only be given if  outputformat=bam.  Then
       valid values are

       0:     do not compute BAM index. This is the default.

       1:     compute  BAM  index.  If  the indexfilename key is set, then the BAM index is written to the given
              file. If indexfilename is unset, then no BAM index will be computed.

       indexfilename file name for output BAM index if index=1.

       hash=<0|1>: use hash of query name instead of a random number for selection. This makes the output depend
       on  how  random  the  hashes  produced for the query names are, but it has the advantage of not requiring
       collation to keep pairs together.  In contast the order of retained reads does not change for hash=1.

AUTHOR

       Written by German Tischler.

REPORTING BUGS

       Report bugs to <tischler@mpi-cbg.de>

COPYRIGHT

       Copyright © 2009-2014 German Tischler, © 2011-2014 Genome Research  Limited.   License  GPLv3+:  GNU  GPL
       version 3 <http://gnu.org/licenses/gpl.html>
       This  is  free software: you are free to change and redistribute it.  There is NO WARRANTY, to the extent
       permitted by law.