Ubuntu Manpage: samtools-consensus - produces a consensus FASTA/FASTQ/PILEUP

NAME

       samtools-consensus - produces a consensus FASTA/FASTQ/PILEUP

SYNOPSIS

       samtools  consensus  [-saAMq]  [-r region] [-f format] [-l line-len] [-d min-depth] [-C cutoff] [-c call-
       fract] [-H het-fract] in.bam

DESCRIPTION

Generate consensus from a SAM, BAM or CRAM file based on the contents of the alignment records. The
consensus is written either as FASTA, FASTQ, or a pileup oriented format. This is selected using the -f
FORMAT option.

The default output for FASTA and FASTQ formats include one base per non-gap consensus. Hence insertions
with respect to the aligned reference will be included and deletions removed. This behaviour can be
controlled with the --show-ins and --show-del options. This could be used to compute a new reference
from sequences assemblies to realign against.

The pileup-style format strictly adheres to one row per consensus location, differing from the one row
per reference based used in the related "samtools mpileup" command. This means the base quality values
for inserted columns are reported. The base quality value of gaps (either within an insertion or
otherwise) are determined as the average of the surrounding non-gap bases. The columns shown are the
reference name, position, nth base at that position (zero if not an insertion), consensus call, consensus
confidence, sequences and quality values. Note even when a reference is supplied, the consensus base is
always report (if non-zero depth) in the 5th column.

Two consensus calling algorithms are offered. The default computes a heterozygous consensus in a
Bayesian manner, derived from the "Gap5" consensus algorithm. Quality values are also tweaked to take
into account other nearby low quality values. This can also be disabled, using the --no-adj-qual option.

This method also utilises the mapping qualities, unless the --no-use-MQ option is used. Mapping
qualities are also auto-scaled to take into account the local reference variation by processing the MD:Z
tag, unless --no-adj-MQ is used. Mapping qualities can be capped between a minimum (--low-MQ) and
maximum (--high-MQ), although the defaults are liberal and trust the data to be true. Finally an overall
scale on the resulting mapping quality can be supplied (--scale-MQ, defaulting to 1.0). This has the
effect of favouring more calls with a higher false positive rate (values greater than 1.0) or being more
cautious with higher false negative rates and lower false positive (values less than 1.0).

The second method is a simple frequency counting algorithm, summing either +1 for each base type or +qual
if the --use-qual option is specified. This is enabled with the --mode simple option.

The summed share of a specific base type is then compared against the total possible and if this is above
the --call-fract fraction parameter then the most likely base type is called, or "N" otherwise (or absent
if it is a gap). The --ambig option permits generation of ambiguity codes instead of "N", provided the
minimum fraction of the second most common base type to the most common is above the --het-fract
fraction.

OPTIONS

General options that apply to both algorithms:

-r REG, --region REG
Limit the query to region REG. This requires an index.

-f FMT, --format FMT
Produce format FMT, with "fastq", "fasta" and "pileup" as permitted options.

-l N, --line-len N
Sets the maximum line length of line-wrapped fasta and fastq formats to N.

-o FILE, --output FILE
Output consensus to FILE instead of stdout.

-m STR, --mode STR
Select the consensus algorithm. Valid modes are "simple" frequency counting and the "bayesian"
(Gap5) methods, with Bayesian being the default. (Note case does not matter, so "Bayesian" is
accepted too.) There are a variety of bayesian methods. Straight "bayesian" is the best set
suitable for the other parameters selected. The choice of internal parameters may change
depending on the "--P-indel" score. This method distinguishes between substitution and indel
error rates. The old Samtools consensus in version 1.16 did not distinguish types of errors,
but for compatibility the "bayesian_116" mode may be selected to replicate this.

-a Outputs all bases, from start to end of reference, even when the aligned data does not extend
to the ends. This is most useful for construction of a full length reference sequence.

-a -a, -aa
Output absolutely all positions, including references with no data aligned against them.

--rf, --incl-flags STR|INT
Only include reads with at least one FLAG bit set. Defaults to zero, which filters no reads.

--ff, --excl-flags STR|INT
Exclude reads with any FLAG bit set. Defaults to "UNMAP,SECONDARY,QCFAIL,DUP".

--min-MQ INT
Filters out reads with a mapping quality below INT. This defaults to zero.

--min-BQ INT
Filters out bases with a base quality below INT. This defaults to zero.

--show-del yes/no
Whether to show deletions as "*" (yes) or to omit from the output (no). Defaults to no.

--show-ins yes/no
Whether to show insertions in the consensus. Defaults to yes.

--mark-ins
Insertions, when shown, are normally recorded in the consensus with plain 7-bit ASCII (ACGT, or
acgt if heterozygous). However this makes it impossible to identify the mapping between
consensus coordinates and the original reference coordinates. If fasta output is selected then
the option adds an underscore before every inserted base, plus a corresponding character in the
quality for fastq format. When used in conjunction with -a --show-del yes, this permits an
easy derivation of the consensus to reference coordinate mapping.

-A, --ambig
Enables IUPAC ambiguity codes in the consensus output. Without this the output will be limited
to A, C, G, T, N and *.

-d D, --min-depth D
The minimum depth required to make a call. Defaults to 1. Failing this depth check will
produce consensus "N", or absent if it is an insertion. Note this check is performed after
filtering by flags and mapping/base quality.

-T ref.fa, --reference ref.fa
For base positions with zero coverage, use the supplied reference instead of "N". Note this
does not replace minimum depth or minimum quality filters as the base is known but considiered
low quality so the ambiguity is retained.

--ref-qual INT
When --reference is given this specifies the quality value to use for reference-derived bases.
This defaults to zero.

The following options apply only to the simple consensus mode:

-q, --use-qual
For the simple consensus algorithm, this enables use of base quality values. Instead of
summing 1 per base called, it sums the base quality instead. These sums are also used in the
--call-fract and --het-fract parameters too. Quality values are always used for the "Gap5"
consensus method and this option has no effect. Note currently quality values only affect
SNPs and not inserted sequences, which still get scores with a fixed +1 per base type
occurrence.

-H H, --het-fract H
For consensus columns containing multiple base types, if the second most frequent type is at
least H fraction of the most common type then a heterozygous base type will be reported in the
consensus. Otherwise the most common base is used, provided it meets the --call-fract
parameter (otherwise "N"). The fractions computed may be modified by the use of quality values
if the -q option is enabled. Note although IUPAC has ambiguity codes for A,C,G,T vs any other
A,C,G,T it does not have codes for A,C,G,T vs gap (such as in a heterozygous deletion). Given
the lack of any official code, we use lower-case letter to symbolise a half-present base type.

-c C, --call-fract C
Only used for the simple consensus algorithm. Require at least C fraction of bases agreeing
with the most likely consensus call to emit that base type. This defaults to 0.75. Failing
this check will output "N".

-@ NTHREADS
Specify the number of additional threads to use for computing the consensus. Note if no index
is present threads will only be used for parallel decompression meaning asking for more than 2
threads is unlikely to speed up processing. With an index the consensus is computed for
multiple regions simultaneously, offering near linear speed ups.

-Z BASE_COUNT
When using multiple threads this specifies the number of bases per threading job. The default
is 500,000 bp for fasta/fastq output and 100,000 for pileup output. Larger blocks may yield
improved threading performance at a cost of more memory.

The following options apply only to Bayesian consensus mode enabled
(default on).

-C C, --cutoff C
Only used for the Gap5 consensus mode, which produces a Phred style score for the final
consensus quality. If this is below C then the consensus is called as "N".

--use-MQ, --no-use-MQ
Enable or disable the use of mapping qualities. Defaults to on.

--adj-MQ, --no-adj-MQ
If mapping qualities are used, this controls whether they are scaled by the local number of
mismatches to the reference. The reference is unknown by this tool, so this data is obtained
from the MD:Z auxiliary tag (or ignored if not present). Defaults to on.

--NM-halo INT
Specifies the distance either side of the base call being considered for computing the number
of local mismatches.

--low-MQ MIN, --high-MQ MAX
Specifies a minimum and maximum value of the mapping quality. These are not filters and
instead simply put upper and lower caps on the values. The defaults are 0 and 60.

--scale-MQ FLOAT
This is a general multiplicative mapping quality scaling factor. The effect is to globally
raise or lower the quality values used in the consensus algorithm. Defaults to 1.0, which
leaves the values unchanged.

--P-het FLOAT
Controls the likelihood of any position being a heterozygous site. This is used in the priors
for the Bayesian calculations, and has little difference on deep data. Defaults to 1e-3.
Smaller numbers makes the algorithm more likely to call a pure base type. Note the algorithm
will always compute the probability of the base being homozygous vs heterozygous, irrespective
of whether the output is reported as ambiguous (it will be "N" if deemed to be heterozygous
without --ambig mode enabled).

--P-indel FLOAT
Controls the likelihood of small indels. This is used in the priors for the Bayesian
calculations, and has little difference on deep data. Defaults to 2e-4.

--het-scale FLOAT
This is a multiplicative correction applied per base quality before adding to the heterozygous
hypotheses. Reducing it means fewer heterozygous calls are made. This oftens leads a
significant reduction in false positive het calls, for some increase in false negatives
(mislabelling real heterozygous sites as homozygous). It is usually beneficial to reduce this
on instruments where a significant proportion of bases may be aligned in the wrong column due
to insertions and deletions leading to alignment errors and reference bias. It can be
considered as a het sensitivity tuning parameter. Defaults to 1.0 (nop).

-p, --homopoly-fix
Some technologies that call runs of the same base type together always put the lowest quality
calls at one end. This can cause problems when reverse complementing and comparing alignments
with indels. This option averages the qualities at both ends to avoid orientation biases.
Recommended for old 454 or PacBio HiFi data sets.

--homopoly-score FLOAT
The -p option also reduces confidence values within homopolymers due to an additional
likelihood of sequence specific errors. The quality values are multiplied by FLOAT. This
defaults to 0.5, but is not used if -p was not specified. Adjusting this score also
automatically enables -p.

-t, --qual-calibration FILE
Loads a quality calibration table from FILE. The format of this is a series of lines with the
following fields, each starting with the literal text "QUAL":

QUAL value substitution undercall overcall

Lines starting with a "#" are ignored. Each line maps a recorded quality value to the Phred
equivalent score for substitution, undercall and overcall errors. Quality values are expected
to be sorted in increasing numerical order, but may skip values. This allows the consensus
algorithm to know the most likely cause of an error, and whether the instrument is more likely
to have indel errors (more common in some long read technologies) or substitution errors (more
common in clocked short-read instruments).

Some pre-defined calibration tables are built in. These are specified with a fake filename
starting with a colon. See the -X option for more details.

Note due to the additional heuristics applied by the consensus algorithm, these recalibration
tables are not a true reflection of the instrument error rates and are a work in progress.

-X, --config STR
Specifies predefined sets of configuration parameters. Acceptable values for STR are defined
below, along with the list of parameters they are equivalent to.

hiseq --qual-calibration :hiseq

hifi --qual-calibration :hifi --homopoly-fix 0.3 --low-MQ 5 --scale-MQ 1.5 --het-scale
0.37

r10.4_sup --qual-calibration :r10.4_sup --homopoly-fix 0.3 --low-MQ 5 --scale-MQ 1.5 --het-
scale 0.37

r10.4_dup --qual-calibration :r10.4_dup --homopoly-fix 0.3 --low-MQ 5 --scale-MQ 1.5 --het-
scale 0.37

ultima --qual-calibration :ultima --homopoly-fix 0.3 --low-MQ 10 --scale-MQ 2 --het-scale
0.37

EXAMPLES

       -      Create  a  modified  FASTA  reference  that  has a 1:1 coordinate correspondence with the original
              reference used in alignment.

                samtools consensus -a --show-ins no --show-del yes in.bam -o ref.fa

       -      Create a FASTQ file for the contigs with aligned data, including insertions.

                samtools consensus -f fastq in.bam -o cons.fq

AUTHOR