Provided by: samtools_1.20-3_amd64 bug

NAME

       samtools-consensus - produces a consensus FASTA/FASTQ/PILEUP

SYNOPSIS

       samtools  consensus  [-saAMq]  [-r  region]  [-f  format] [-l line-len] [-d min-depth] [-C
       cutoff] [-c call-fract] [-H het-fract] in.bam

DESCRIPTION

       Generate consensus from a SAM, BAM or CRAM file based on the  contents  of  the  alignment
       records.   The  consensus  is written either as FASTA, FASTQ, or a pileup oriented format.
       This is selected using the -f FORMAT option.

       The default output for FASTA and FASTQ formats include one  base  per  non-gap  consensus.
       Hence  insertions  with  respect  to  the aligned reference will be included and deletions
       removed.  This behaviour can be controlled with the  --show-ins  and  --show-del  options.
       This  could  be  used  to  compute  a  new  reference from sequences assemblies to realign
       against.

       The pileup-style format strictly adheres to one row per consensus location, differing from
       the  one  row  per  reference  based used in the related "samtools mpileup" command.  This
       means the base quality values for inserted columns are reported.  The base  quality  value
       of  gaps  (either  within  an insertion or otherwise) are determined as the average of the
       surrounding non-gap bases.  The columns shown are the reference name, position,  nth  base
       at  that  position  (zero  if  not  an  insertion),  consensus call, consensus confidence,
       sequences and quality values.

       Two consensus calling  algorithms  are  offered.   The  default  computes  a  heterozygous
       consensus  in  a  Bayesian  manner,  derived from the "Gap5" consensus algorithm.  Quality
       values are also tweaked to take into account other nearby low quality  values.   This  can
       also be disabled, using the --no-adj-qual option.

       This  method  also  utilises the mapping qualities, unless the --no-use-MQ option is used.
       Mapping qualities are also auto-scaled to take into account the local reference  variation
       by  processing  the MD:Z tag, unless --no-adj-MQ is used.  Mapping qualities can be capped
       between a minimum (--low-MQ) and maximum (--high-MQ), although the  defaults  are  liberal
       and  trust the data to be true.  Finally an overall scale on the resulting mapping quality
       can be supplied (--scale-MQ, defaulting to 1.0).  This has the effect  of  favouring  more
       calls  with  a higher false positive rate (values greater than 1.0) or being more cautious
       with higher false negative rates and lower false positive (values less than 1.0).

       The second method is a simple frequency counting algorithm, summing  either  +1  for  each
       base type or +qual if the --use-qual option is specified.  This is enabled with the --mode
       simple option.

       The summed share of a specific base type is then compared against the total  possible  and
       if  this  is  above  the --call-fract fraction parameter then the most likely base type is
       called, or "N" otherwise (or  absent  if  it  is  a  gap).   The  --ambig  option  permits
       generation  of ambiguity codes instead of "N", provided the minimum fraction of the second
       most common base type to the most common is above the --het-fract fraction.

OPTIONS

       General options that apply to both algorithms:

       -r REG, --region REG
                 Limit the query to region REG.  This requires an index.

       -f FMT, --format FMT
                 Produce format FMT, with "fastq", "fasta" and "pileup" as permitted options.

       -l N, --line-len N
                 Sets the maximum line length of line-wrapped fasta and fastq formats to N.

       -o FILE, --output FILE
                 Output consensus to FILE instead of stdout.

       -m STR, --mode STR
                 Select the consensus algorithm.  Valid modes are "simple" frequency counting and
                 the "bayesian" (Gap5) methods, with Bayesian being the default.  (Note case does
                 not matter, so "Bayesian" is accepted too.)  There are  a  variety  of  bayesian
                 methods.   Straight "bayesian" is the best set suitable for the other parameters
                 selected.  The choice of internal parameters may change depending on  the  "--P-
                 indel"  score.   This  method distinguishes between substitution and indel error
                 rates.  The old Samtools consensus in version 1.16 did not distinguish types  of
                 errors,  but  for  compatibility  the  "bayesian_116"  mode  may  be selected to
                 replicate this.

       -a        Outputs all bases, from start to end of reference, even when  the  aligned  data
                 does  not  extend  to  the ends.  This is most useful for construction of a full
                 length reference sequence.

       -a -a, -aa
                 Output absolutely all positions,  including  references  with  no  data  aligned
                 against them.

       --rf, --incl-flags STR|INT
                 Only  include  reads  with  at  least one FLAG bit set.  Defaults to zero, which
                 filters no reads.

       --ff, --excl-flags STR|INT
                 Exclude reads with any FLAG bit set.  Defaults to "UNMAP,SECONDARY,QCFAIL,DUP".

       --min-MQ INT
                 Filters out reads with a mapping quality below INT.  This defaults to zero.

       --min-BQ INT
                 Filters out bases with a base quality below INT.  This defaults to zero.

       --show-del yes/no
                 Whether to show deletions as  "*"  (yes)  or  to  omit  from  the  output  (no).
                 Defaults to no.

       --show-ins yes/no
                 Whether to show insertions in the consensus.  Defaults to yes.

       --mark-ins
                 Insertions,  when shown, are normally recorded in the consensus with plain 7-bit
                 ASCII (ACGT, or acgt if heterozygous).  However  this  makes  it  impossible  to
                 identify  the  mapping  between consensus coordinates and the original reference
                 coordinates.  If fasta output is selected then the  option  adds  an  underscore
                 before  every  inserted  base, plus a corresponding character in the quality for
                 fastq format.  When used in conjunction with -a --show-del yes, this permits  an
                 easy derivation of the consensus to reference coordinate mapping.

       -A, --ambig
                 Enables  IUPAC ambiguity codes in the consensus output.  Without this the output
                 will be limited to A, C, G, T, N and *.

       -d D, --min-depth D
                 The minimum depth required to make a call.  Defaults to 1.  Failing  this  depth
                 check  will  produce  consensus "N", or absent if it is an insertion.  Note this
                 check is performed after filtering by flags and mapping/base quality.

       The following options apply only to the simple consensus mode:

       -q, --use-qual
                 For the simple consensus algorithm, this enables use  of  base  quality  values.
                 Instead  of  summing 1 per base called, it sums the base quality instead.  These
                 sums are also used in the --call-fract and --het-fract parameters too.   Quality
                 values  are  always  used for the "Gap5" consensus method and this option has no
                 affect.  Note currently  quality  values  only  affect  SNPs  and  not  inserted
                 sequences, which still get scores with a fixed +1 per base type occurrence.

       -H H, --het-fract H
                 For  consensus  columns  containing  multiple  base  types,  if  the second most
                 frequent type is at least H fraction of the most common type then a heterozygous
                 base  type will be reported in the consensus.  Otherwise the most common base is
                 used, provided  it  meets  the  --call-fract  parameter  (otherwise  "N").   The
                 fractions computed may be modified by the use of quality values if the -q option
                 is enabled.  Note although IUPAC has ambiguity codes for A,C,G,T  vs  any  other
                 A,C,G,T  it  does  not  have codes for A,C,G,T vs gap (such as in a heterozygous
                 deletion).  Given the lack of any official code, we  use  lower-case  letter  to
                 symbolise a half-present base type.

       -c C, --call-fract C
                 Only  used  for  the simple consensus algorithm.  Require at least C fraction of
                 bases agreeing with the most likely consensus call to emit that base type.  This
                 defaults to 0.75.  Failing this check will output "N".

       The following options apply only to Bayesian consensus mode enabled
       (default on).

       -C C, --cutoff C
                 Only  used  for  the Gap5 consensus mode, which produces a Phred style score for
                 the final consensus quality.  If this is below C then the consensus is called as
                 "N".

       --use-MQ, --no-use-MQ
                 Enable or disable the use of mapping qualities.  Defaults to on.

       --adj-MQ, --no-adj-MQ
                 If  mapping  qualities  are  used,  this controls whether they are scaled by the
                 local number of mismatches to the reference.  The reference is unknown  by  this
                 tool,  so  this  data is obtained from the MD:Z auxiliary tag (or ignored if not
                 present).  Defaults to on.

       --NM-halo INT
                 Specifies the distance either  side  of  the  base  call  being  considered  for
                 computing the number of local mismatches.

       --low-MQ MIN, --high-MQ MAX
                 Specifies  a  minimum  and  maximum value of the mapping quality.  These are not
                 filters and instead simply put upper and lower caps on the values.  The defaults
                 are 0 and 60.

       --scale-MQ FLOAT
                 This is a general multiplicative  mapping quality scaling factor.  The effect is
                 to globally raise or lower the quality values used in the  consensus  algorithm.
                 Defaults to 1.0, which leaves the values unchanged.

       --P-het FLOAT
                 Controls the likelihood of any position being a heterozygous site.  This is used
                 in the priors for the Bayesian calculations, and has little difference  on  deep
                 data.   Defaults  to  1e-3.   Smaller numbers makes the algorithm more likely to
                 call a pure base type.  Note the algorithm will always compute  the  probability
                 of the base being homozygous vs heterozygous, irrespective of whether the output
                 is reported as ambiguous (it will be "N" if deemed to  be  heterozygous  without
                 --ambig mode enabled).

       --P-indel FLOAT
                 Controls  the  likelihood  of  small indels.  This is used in the priors for the
                 Bayesian calculations, and has little difference  on  deep  data.   Defaults  to
                 2e-4.

       --het-scale FLOAT
                 This  is  a  multiplicative correction applied per base quality before adding to
                 the heterozygous hypotheses.  Reducing it means  fewer  heterozygous  calls  are
                 made.   This  oftens  leads a significant reduction in false positive het calls,
                 for some increase in false negatives (mislabelling real  heterozygous  sites  as
                 homozygous).   It  is  usually  beneficial to reduce this on instruments where a
                 significant proportion of bases may be  aligned  in  the  wrong  column  due  to
                 insertions and deletions leading to alignment errors and reference bias.  It can
                 be considered as a het sensitivity tuning parameter.  Defaults to 1.0 (nop).

       -p, --homopoly-fix
                 Some technologies that call runs of the same base type together always  put  the
                 lowest  quality  calls  at  one  end.   This  can  cause  problems  when reverse
                 complementing and comparing alignments with indels.  This  option  averages  the
                 qualities  at both ends to avoid orientation biases.  Recommended for old 454 or
                 PacBio HiFi data sets.

       --homopoly-score FLOAT
                 The -p option also reduces confidence  values  within  homopolymers  due  to  an
                 additional  likelihood  of  sequence  specific  errors.   The quality values are
                 multiplied by FLOAT.  This defaults to 0.5, but  is  not  used  if  -p  was  not
                 specified.  Adjusting this score also automatically enables -p.

       -t, --qual-calibration FILE
                 Loads  a quality calibration table from FILE.  The format of this is a series of
                 lines with the following fields, each starting with the literal text "QUAL":

                     QUAL value substitution undercall overcall

                 Lines starting with a "#" are ignored.  Each line maps a recorded quality  value
                 to  the  Phred equivalent score for substitution, undercall and overcall errors.
                 Quality values are expected to be sorted in increasing numerical order, but  may
                 skip  values.  This allows the consensus algorithm to know the most likely cause
                 of an error, and whether the instrument is more  likely  to  have  indel  errors
                 (more common in some long read technologies) or substitution errors (more common
                 in clocked short-read instruments).

                 Some pre-defined calibration tables are built in.  These are  specified  with  a
                 fake filename starting with a colon.  See the -X option for more details.

                 Note  due to the additional heuristics applied by the consensus algorithm, these
                 recalibration tables are not a true reflection of the instrument error rates and
                 are a work in progress.

       -X, --config STR
                 Specifies  predefined  sets  of configuration parameters.  Acceptable values for
                 STR are defined below, along with the list of parameters they are equivalent to.

                 hiseq     --qual-calibration :hiseq

                 hifi      --qual-calibration :hifi --homopoly-fix 0.3 --low-MQ 5 --scale-MQ  1.5
                           --het-scale 0.37

                 r10.4_sup --qual-calibration :r10.4_sup --homopoly-fix 0.3 --low-MQ 5 --scale-MQ
                           1.5 --het-scale 0.37

                 r10.4_dup --qual-calibration :r10.4_dup --homopoly-fix 0.3 --low-MQ 5 --scale-MQ
                           1.5 --het-scale 0.37

                 ultima    --qual-calibration :ultima --homopoly-fix 0.3 --low-MQ 10 --scale-MQ 2
                           --het-scale 0.37

EXAMPLES

       -      Create a modified FASTA reference that has a 1:1 coordinate correspondence with the
              original reference used in alignment.

                samtools consensus -a --show-ins no in.bam -o ref.fa

       -      Create a FASTQ file for the contigs with aligned data, including insertions.

                samtools consensus -f fastq in.bam -o cons.fq

AUTHOR

       Written by James Bonfield from the Sanger Institute.

SEE ALSO

       samtools(1), samtools-mpileup(1),

       Samtools website: <http://www.htslib.org/>