Ubuntu Manpage: fastqtobam - convert FastQ to unmapped BAM

NAME

       fastqtobam - convert FastQ to unmapped BAM

SYNOPSIS

       fastqtobam [options]

DESCRIPTION

fastqtobam reads one or two FastQ files and converts them to a BAM file in which each read
is marked as unmapped. If no input file name is given, then a single FastQ file is read
from standard input. If one file name is given, then a single FastQ file is read from the
given file. In both cases the read names in the file are parsed to determine whether the
contained reads are paired or not if the name scheme is not set to pairedfiles. If two
file names are given, then the program assumes to find two FastQ files which are
synchronous, i.e. where the first read in the first file is the mate of the first read in
the second file etc. Input file names can be given either via the I key or after the
key=value pairs on the command line. The program accepts read name formats as described
below under the key namescheme.

The following key=value pairs can be given:

verbose=<[0|1]> print progress report. By default progress is not reported.

I=<filename>: input file name (data is read from standard input if this option is not
given). This key can be given twice.

level=<-1|0|1|9|11>: set compression level of the output BAM file. Valid values are

-1: zlib/gzip default compression level

0: uncompressed

1: zlib/gzip level 1 (fast) compression

9: zlib/gzip level 9 (best) compression

If libmaus has been compiled with support for igzip (see https://software.intel.com/en-
us/articles/igzip-a-high-performance-deflate-compressor-with-optimizations-for-genomic-
data) then an additional valid value is

11: igzip compression

md5=<0|1>: md5 checksum creation for output file. Valid values are

0: do not compute checksum. This is the default.

1: compute checksum. If the md5filename key is set, then the checksum is written to
the given file. If md5filename is unset, then no checksum will be computed.

md5filename file name for md5 checksum if md5=1.

gz=<[0|1]> input is gzip compressed FastQ. By default input is assumed to be uncompressed
FastQ.

threads=<1> additional BAM encoding helper threads.

PGID=<> read group identifier for reads. By default no read group identifer is set. The
fields CN, DS, DT, FO, KS, LB, PG, PI, PL, PU and SM of the corresponding @RG header line
can be set by using the keys RGCN, RGDS, etc. respectively.

qualityoffset=<33> FastQ quality offset. This value is subtracted from the ASCII character
representation to get the quality score value.

qualitymax=<41> maximum valid quality value, 41 by default. Higher values may indicate a
wrong setting of the qualityoffset parameter. BAM allows quality values up to the value of
94.

qualityhist=<0> compute a quality histogram and print it on the standard error channel
after processing has finished successfully. Lines for the quality histogram are prefixed
with [H] and contain tab separated values. The histogram enumerates quality scores from
high to low values. The histogram has four columns (after the [H] marker). The first is
the ASCII representation of the quality with offset 33, i.e. the symbol ! denotes quality
0. The second column gives the absolute frequency of the value. The third column stores
the relative frequency of the value, i.e. the fraction of all values assigned to this
value. The fourth column gives a cumulative relative frequency value over all quality for
the current line and those for higher quality values.

checkquality=<1> check whether quality values are in range and terminate if an invalid
value is encountered.

namescheme=<generic> read name scheme. This determines how read names are parsed. There
are four possible options:

generic:
the first sequence of non whitespace characters is extracted from the @ line of the
FastQ record and the rest of the @ line is discarded. If the retained name ends in
/1 or /2, then the read is part of a read pair, otherwise it is the single read for
the template. For a pair the part of the name before the /1 or /2 is considered the
template name. For a single the whole name is considered the name of the template.

c18s: The name is expected to consist of two sequences of non white-space characters
where the first contains seven colon separated fields and the second four colon
separated fields. The first of the two is considered to be the name of the
template. It is assumed that this read is the only read for the template.

c18pe: As for c18s, the name is expected to consist of two sequences of non white-space
characters where the first contains seven colon separated fields and the second
four colon separated fields. The first of the two is considered to be the name of
the template. The read is assumed to be part of a read pair. The first field in the
second non-whitespace sequence of the @ line designates, whether it is the first or
second of the pair depending on whether the field stores the number 1 or 2
respectively.

pairedfiles:
The input framgents are assumed to be paired. If there is a single input file then
the pairs are expected consecutive in the file. If there are two input files then
the read names in the two are expected to be synchronous. All characters in read
names beginning from the first white space character are discarded. If the two (so
reduced) read names in question end on /1 and /2 respectively, then those suffixes
will be clipped off also. The remaining read names are checked for equality. If
they are not equal, then the program will reject the input and terminate.

AUTHOR

       Written by German Tischler.

REPORTING BUGS

       Report bugs to <tischler@mpi-cbg.de>

COPYRIGHT

       Copyright  ©  2009-2014  German  Tischler,  ©  2011-2014 Genome Research Limited.  License
       GPLv3+: GNU GPL version 3 <http://gnu.org/licenses/gpl.html>
       This is free software: you are free to change and redistribute it.  There is NO  WARRANTY,
       to the extent permitted by law.