Provided by: samtools_1.22.1-1_amd64 bug

NAME

       samtools-consensus - produces a consensus FASTA/FASTQ/PILEUP

SYNOPSIS

       samtools  consensus  [-saAMq]  [-r region] [-f format] [-l line-len] [-d min-depth] [-C cutoff] [-c call-
       fract] [-H het-fract] in.bam

DESCRIPTION

       Generate consensus from a SAM, BAM or CRAM file based on the contents  of  the  alignment  records.   The
       consensus  is written either as FASTA, FASTQ, or a pileup oriented format.  This is selected using the -f
       FORMAT option.

       The default output for FASTA and FASTQ formats include one base per non-gap consensus.  Hence  insertions
       with  respect  to  the  aligned  reference will be included and deletions removed.  This behaviour can be
       controlled with the --show-ins and --show-del options.  This could be used to  compute  a  new  reference
       from sequences assemblies to realign against.

       The  pileup-style  format  strictly adheres to one row per consensus location, differing from the one row
       per reference based used in the related "samtools mpileup" command.  This means the base  quality  values
       for  inserted  columns  are  reported.   The  base  quality  value of gaps (either within an insertion or
       otherwise) are determined as the average of the surrounding non-gap bases.  The  columns  shown  are  the
       reference name, position, nth base at that position (zero if not an insertion), consensus call, consensus
       confidence,  sequences and quality values.  Note even when a reference is supplied, the consensus base is
       always report (if non-zero depth) in the 5th column.

       Two consensus calling algorithms are offered.   The  default  computes  a  heterozygous  consensus  in  a
       Bayesian  manner,  derived  from the "Gap5" consensus algorithm.  Quality values are also tweaked to take
       into account other nearby low quality values.  This can also be disabled, using the --no-adj-qual option.

       This method also utilises the  mapping  qualities,  unless  the  --no-use-MQ  option  is  used.   Mapping
       qualities  are also auto-scaled to take into account the local reference variation by processing the MD:Z
       tag, unless --no-adj-MQ is used.  Mapping qualities can  be  capped  between  a  minimum  (--low-MQ)  and
       maximum (--high-MQ), although the defaults are liberal and trust the data to be true.  Finally an overall
       scale  on  the  resulting  mapping quality can be supplied (--scale-MQ, defaulting to 1.0).  This has the
       effect of favouring more calls with a higher false positive rate (values greater than 1.0) or being  more
       cautious with higher false negative rates and lower false positive (values less than 1.0).

       The second method is a simple frequency counting algorithm, summing either +1 for each base type or +qual
       if the --use-qual option is specified.  This is enabled with the --mode simple option.

       The summed share of a specific base type is then compared against the total possible and if this is above
       the --call-fract fraction parameter then the most likely base type is called, or "N" otherwise (or absent
       if  it  is a gap).  The --ambig option permits generation of ambiguity codes instead of "N", provided the
       minimum fraction of the second most common base  type  to  the  most  common  is  above  the  --het-fract
       fraction.

OPTIONS

       General options that apply to both algorithms:

       -r REG, --region REG
                 Limit the query to region REG.  This requires an index.

       -f FMT, --format FMT
                 Produce format FMT, with "fastq", "fasta" and "pileup" as permitted options.

       -l N, --line-len N
                 Sets the maximum line length of line-wrapped fasta and fastq formats to N.

       -o FILE, --output FILE
                 Output consensus to FILE instead of stdout.

       -m STR, --mode STR
                 Select the consensus algorithm.  Valid modes are "simple" frequency counting and the "bayesian"
                 (Gap5)  methods, with Bayesian being the default.  (Note case does not matter, so "Bayesian" is
                 accepted too.)  There are a variety of bayesian methods.  Straight "bayesian" is the  best  set
                 suitable  for  the  other  parameters  selected.   The choice of internal parameters may change
                 depending on the "--P-indel" score.  This method distinguishes between substitution  and  indel
                 error  rates.   The old Samtools consensus in version 1.16 did not distinguish types of errors,
                 but for compatibility the "bayesian_116" mode may be selected to replicate this.

       -a        Outputs all bases, from start to end of reference, even when the aligned data does  not  extend
                 to the ends.  This is most useful for construction of a full length reference sequence.

       -a -a, -aa
                 Output absolutely all positions, including references with no data aligned against them.

       --rf, --incl-flags STR|INT
                 Only include reads with at least one FLAG bit set.  Defaults to zero, which filters no reads.

       --ff, --excl-flags STR|INT
                 Exclude reads with any FLAG bit set.  Defaults to "UNMAP,SECONDARY,QCFAIL,DUP".

       --min-MQ INT
                 Filters out reads with a mapping quality below INT.  This defaults to zero.

       --min-BQ INT
                 Filters out bases with a base quality below INT.  This defaults to zero.

       --show-del yes/no
                 Whether to show deletions as "*" (yes) or to omit from the output (no).  Defaults to no.

       --show-ins yes/no
                 Whether to show insertions in the consensus.  Defaults to yes.

       --mark-ins
                 Insertions, when shown, are normally recorded in the consensus with plain 7-bit ASCII (ACGT, or
                 acgt  if  heterozygous).   However  this  makes  it  impossible to identify the mapping between
                 consensus coordinates and the original reference coordinates.  If fasta output is selected then
                 the option adds an underscore before every inserted base, plus a corresponding character in the
                 quality for fastq format.  When used in conjunction with -a --show-del  yes,  this  permits  an
                 easy derivation of the consensus to reference coordinate mapping.

       -A, --ambig
                 Enables IUPAC ambiguity codes in the consensus output.  Without this the output will be limited
                 to A, C, G, T, N and *.

       -d D, --min-depth D
                 The  minimum  depth  required  to  make  a call.  Defaults to 1.  Failing this depth check will
                 produce consensus "N", or absent if it is an insertion.  Note this  check  is  performed  after
                 filtering by flags and mapping/base quality.

       -T ref.fa, --reference ref.fa
                 For  base  positions  with zero coverage, use the supplied reference instead of "N".  Note this
                 does not replace minimum depth or minimum quality filters as the base is known but  considiered
                 low quality so the ambiguity is retained.

       --ref-qual INT
                 When  --reference is given this specifies the quality value to use for reference-derived bases.
                 This defaults to zero.

       The following options apply only to the simple consensus mode:

       -q, --use-qual
                 For the simple consensus algorithm, this enables  use  of  base  quality  values.   Instead  of
                 summing  1  per base called, it sums the base quality instead.  These sums are also used in the
                 --call-fract and --het-fract parameters too.  Quality values are always  used  for  the  "Gap5"
                 consensus  method  and  this  option has no effect.  Note currently  quality values only affect
                 SNPs and not inserted sequences, which  still  get  scores  with  a  fixed  +1  per  base  type
                 occurrence.

       -H H, --het-fract H
                 For  consensus  columns  containing multiple base types, if the second most frequent type is at
                 least H fraction of the most common type then a heterozygous base type will be reported in  the
                 consensus.   Otherwise  the  most  common  base  is  used,  provided  it meets the --call-fract
                 parameter (otherwise "N").  The fractions computed may be modified by the use of quality values
                 if the -q option is enabled.  Note although IUPAC has ambiguity codes for A,C,G,T vs any  other
                 A,C,G,T  it does not have codes for A,C,G,T vs gap (such as in a heterozygous deletion).  Given
                 the lack of any official code, we use lower-case letter to symbolise a half-present base type.

       -c C, --call-fract C
                 Only used for the simple consensus algorithm.  Require at least C fraction  of  bases  agreeing
                 with  the  most  likely consensus call to emit that base type.  This defaults to 0.75.  Failing
                 this check will output "N".

       -@ NTHREADS
                 Specify the number of additional threads to use for computing the consensus.  Note if no  index
                 is  present threads will only be used for parallel decompression meaning asking for more than 2
                 threads is unlikely to speed up processing.  With  an  index  the  consensus  is  computed  for
                 multiple regions simultaneously, offering near linear speed ups.

       -Z BASE_COUNT
                 When  using multiple threads this specifies the number of bases per threading job.  The default
                 is 500,000 bp for fasta/fastq output and 100,000 for pileup output.  Larger  blocks  may  yield
                 improved threading performance at a cost of more memory.

       The following options apply only to Bayesian consensus mode enabled
       (default on).

       -C C, --cutoff C
                 Only  used  for  the  Gap5  consensus  mode,  which  produces a Phred style score for the final
                 consensus quality.  If this is below C then the consensus is called as "N".

       --use-MQ, --no-use-MQ
                 Enable or disable the use of mapping qualities.  Defaults to on.

       --adj-MQ, --no-adj-MQ
                 If mapping qualities are used, this controls whether they are scaled by  the  local  number  of
                 mismatches  to  the reference.  The reference is unknown by this tool, so this data is obtained
                 from the MD:Z auxiliary tag (or ignored if not present).  Defaults to on.

       --NM-halo INT
                 Specifies the distance either side of the base call being considered for computing  the  number
                 of local mismatches.

       --low-MQ MIN, --high-MQ MAX
                 Specifies  a  minimum  and  maximum  value  of  the mapping quality.  These are not filters and
                 instead simply put upper and lower caps on the values.  The defaults are 0 and 60.

       --scale-MQ FLOAT
                 This is a general multiplicative  mapping quality scaling factor.  The effect  is  to  globally
                 raise  or  lower  the  quality  values used in the consensus algorithm.  Defaults to 1.0, which
                 leaves the values unchanged.

       --P-het FLOAT
                 Controls the likelihood of any position being a heterozygous site.  This is used in the  priors
                 for  the  Bayesian  calculations,  and  has  little difference on deep data.  Defaults to 1e-3.
                 Smaller numbers makes the algorithm more likely to call a pure base type.  Note  the  algorithm
                 will  always compute the probability of the base being homozygous vs heterozygous, irrespective
                 of whether the output is reported as ambiguous (it will be "N" if  deemed  to  be  heterozygous
                 without --ambig mode enabled).

       --P-indel FLOAT
                 Controls  the  likelihood  of  small  indels.   This  is  used  in  the priors for the Bayesian
                 calculations, and has little difference on deep data.  Defaults to 2e-4.

       --het-scale FLOAT
                 This is a multiplicative correction applied per base quality before adding to the  heterozygous
                 hypotheses.   Reducing  it  means  fewer  heterozygous  calls  are  made.   This oftens leads a
                 significant reduction in false positive  het  calls,  for  some  increase  in  false  negatives
                 (mislabelling  real heterozygous sites as homozygous).  It is usually beneficial to reduce this
                 on instruments where a significant proportion of bases may be aligned in the wrong  column  due
                 to  insertions  and  deletions  leading  to  alignment  errors  and  reference bias.  It can be
                 considered as a het sensitivity tuning parameter.  Defaults to 1.0 (nop).

       -p, --homopoly-fix
                 Some technologies that call runs of the same base type together always put the  lowest  quality
                 calls  at one end.  This can cause problems when reverse complementing and comparing alignments
                 with indels.  This option averages the qualities at both  ends  to  avoid  orientation  biases.
                 Recommended for old 454 or PacBio HiFi data sets.

       --homopoly-score FLOAT
                 The  -p  option  also  reduces  confidence  values  within  homopolymers  due  to an additional
                 likelihood of sequence specific errors.  The quality values  are  multiplied  by  FLOAT.   This
                 defaults  to  0.5,  but  is  not  used  if  -p  was  not  specified.  Adjusting this score also
                 automatically enables -p.

       -t, --qual-calibration FILE
                 Loads a quality calibration table from FILE.  The format of this is a series of lines with  the
                 following fields, each starting with the literal text "QUAL":

                     QUAL value substitution undercall overcall

                 Lines  starting  with  a "#" are ignored.  Each line maps a recorded quality value to the Phred
                 equivalent score for substitution, undercall and overcall errors.  Quality values are  expected
                 to  be  sorted  in  increasing numerical order, but may skip values.  This allows the consensus
                 algorithm to know the most likely cause of an error, and whether the instrument is more  likely
                 to  have indel errors (more common in some long read technologies) or substitution errors (more
                 common in clocked short-read instruments).

                 Some pre-defined calibration tables are built in.  These are specified  with  a  fake  filename
                 starting with a colon.  See the -X option for more details.

                 Note  due  to the additional heuristics applied by the consensus algorithm, these recalibration
                 tables are not a true reflection of the instrument error rates and are a work in progress.

       -X, --config STR
                 Specifies predefined sets of configuration parameters.  Acceptable values for STR  are  defined
                 below, along with the list of parameters they are equivalent to.

                 hiseq     --qual-calibration :hiseq

                 hifi      --qual-calibration  :hifi  --homopoly-fix  0.3  --low-MQ 5 --scale-MQ 1.5 --het-scale
                           0.37

                 r10.4_sup --qual-calibration :r10.4_sup --homopoly-fix 0.3 --low-MQ  5  --scale-MQ  1.5  --het-
                           scale 0.37

                 r10.4_dup --qual-calibration  :r10.4_dup  --homopoly-fix  0.3  --low-MQ 5 --scale-MQ 1.5 --het-
                           scale 0.37

                 ultima    --qual-calibration :ultima --homopoly-fix 0.3 --low-MQ 10  --scale-MQ  2  --het-scale
                           0.37

EXAMPLES

       -      Create  a  modified  FASTA  reference  that  has a 1:1 coordinate correspondence with the original
              reference used in alignment.

                samtools consensus -a --show-ins no --show-del yes in.bam -o ref.fa

       -      Create a FASTQ file for the contigs with aligned data, including insertions.

                samtools consensus -f fastq in.bam -o cons.fq

AUTHOR

       Written by James Bonfield from the Sanger Institute.

SEE ALSO

       samtools(1), samtools-mpileup(1),

       Samtools website: <http://www.htslib.org/>

samtools-1.22.1                                   14 July 2025                             samtools-consensus(1)