Provided by: samtools_1.16.1-1_amd64 bug

NAME

       samtools-consensus - produces produce a consensus FASTA/FASTQ/PILEUP

SYNOPSIS

       samtools  consensus  [-saAMq]  [-r  region]  [-f  format] [-l line-len] [-d min-depth] [-C
       cutoff] [-c call-fract] [-H het-fract] in.bam

DESCRIPTION

       Generate consensus from a SAM, BAM or CRAM file based on the  contents  of  the  alignment
       records.   The  consensus  is written either as FASTA, FASTQ, or a pileup oriented format.
       This is selected using the -f FORMAT option.

       The default output for FASTA and FASTQ formats include one  base  per  non-gap  consensus.
       Hence  insertions  with  respect  to  the aligned reference will be included and deletions
       removed.  This behaviour can be controlled with the  --show-ins  and  --show-del  options.
       This  could  be  used  to  compute  a  new  reference from sequences assemblies to realign
       against.

       The pileup-style format strictly adheres to one row per consensus location, differing from
       the  one  row  per  reference  based used in the related "samtools mpileup" command.  This
       means the base quality values for inserted columns are reported.  The base  quality  value
       of  gaps  (either  within  an insertion or otherwise) are determined as the average of the
       surrounding non-gap bases.  The columns shown are the reference name, position,  nth  base
       at  that  position  (zero  if  not  an  insertion),  consensus call, consensus confidence,
       sequences and quality values.

       Two consensus calling  algorithms  are  offered.   The  default  computes  a  heterozygous
       consensus  in  a  Bayesian  manner,  derived from the "Gap5" consensus algorithm.  Quality
       values are also tweaked to take into account other nearby low quality  values.   This  can
       also be disabled, using the --no-adj-qual option.

       This  method  also  utilises the mapping qualities, unless the --no-use-MQ option is used.
       Mapping qualities are also auto-scaled to take into account the local reference  variation
       by  processing  the MD:Z tag, unless --no-adj-MQ is used.  Mapping qualities can be capped
       between a minimum (--low-MQ) and maximum (--high-MQ), although the  defaults  are  liberal
       and  trust the data to be true.  Finally an overall scale on the resulting mapping quality
       can be supplied (--scale-MQ, defaulting to 1.0).  This has the effect  of  favouring  more
       calls  with  a higher false positive rate (values greater than 1.0) or being more cautious
       with higher false negative rates and lower false positive (values less than 1.0).

       The second method is a simple frequency counting algorithm, summing  either  +1  for  each
       base type or +qual if the --use-qual option is specified.  This is enabled with the --mode
       simple option.

       The summed share of a specific base type is then compared against the total  possible  and
       if  this  is  above  the --call-fract fraction parameter then the most likely base type is
       called, or "N" otherwise (or  absent  if  it  is  a  gap).   The  --ambig  option  permits
       generation  of ambiguity codes instead of "N", provided the minimum fraction of the second
       most common base type to the most common is above the --het-fract fraction.

OPTIONS

       General options that apply to both algorithms:

       -r REG, --region REG
                 Limit the query to region REG.  This requires an index.

       -f FMT, --format FMT
                 Produce format FMT, with "fastq", "fasta" and "pileup" as permitted options.

       -l N, --line-len N
                 Sets the maximum line length of line-wrapped fasta and fastq formats to N.

       -o FILE, --output FILE
                 Output consensus to FILE instead of stdout.

       -m STR, --mode STR
                 Select the consensus algorithm.  Valid modes are "simple" frequency counting and
                 the  "bayesian" (Gap5) method, with Bayesian being the default.  (Note case does
                 not matter, so "Bayesian" is accepted too.)

       -a        Outputs all bases, from start to end of reference, even when  the  aligned  data
                 does  not  extend  to  the ends.  This is most useful for construction of a full
                 length reference sequence.

       --rf, --incl-flags STR|INT
                 Only include reads with at least one FLAG bit  set.   Defaults  to  zero,  which
                 filters no reads.

       --ff, --excl-flags STR|INT
                 Exclude reads with any FLAG bit set.  Defaults to "UNMAP,SECONDARY,QCFAIL,DUP".

       --min-MQ INT
                 Filters out reads with a mapping quality below INT.  This defaults to zero.

       --show-del yes/no
                 Whether  to  show  deletions  as  "*"  (no)  or  to  omit from the output (yes).
                 Defaults to no.

       --show-ins yes/no
                 Whether to show insertions in the consensus.  Defaults to yes.

       -A, --ambig
                 Enables IUPAC ambiguity codes in the consensus output.  Without this the  output
                 will be limited to A, C, G, T, N and *.

       The following options apply only to the simple consensus mode:

       -q, --use-qual
                 For  the  simple  consensus  algorithm, this enables use of base quality values.
                 Instead of summing 1 per base called, it sums the base quality  instead.   These
                 sums  are also used in the --call-fract and --het-fract parameters too.  Quality
                 values are always used for the "Gap5" consensus method and this  option  has  no
                 affect.   Note  currently   quality  values  only  affect  SNPs and not inserted
                 sequences, which still get scores with a fixed +1 per base type occurrence.

       -d D, --min-depth D
                 The minimum depth required to make a call.  Defaults to 1.  Failing  this  depth
                 check will produce consensus "N", or absent if it is an insertion.

       -H H, --het-fract H
                 For  consensus  columns  containing  multiple  base  types,  if  the second most
                 frequent type is at least H fraction of the most common type then a heterozygous
                 base  type will be reported in the consensus.  Otherwise the most common base is
                 used, provided  it  meets  the  --call-fract  parameter  (otherwise  "N").   The
                 fractions computed may be modified by the use of quality values if the -q option
                 is enabled.  Note although IUPAC has ambiguity codes for A,C,G,T  vs  any  other
                 A,C,G,T  it  does  not  have codes for A,C,G,T vs gap (such as in a heterozygous
                 deletion).  Given the lack of any official code, we  use  lower-case  letter  to
                 symbolise a half-present base type.

       -c C, --call-fract C
                 Only  used  for  the simple consensus algorithm.  Require at least C fraction of
                 bases agreeing with the most likely consensus call to omit that base type.  This
                 defaults to 0.75.  Failing this check will output "N".

       The following options apply only to Bayesian consensus mode enabled
       with the -5 option.

       -5        Enable Bayesian consensus algorithm.

       -C C, --cutoff C
                 Only  used  for  the Gap5 consensus mode, which produces a Phred style score for
                 the final consensus quality.  If this is below C then the consensus is called as
                 "N".

       --use-MQ, --no-use-MQ
                 Enable or disable the use of mapping qualities.  Defaults to on.

       --adj-MQ, --no-adj-MQ
                 If  mapping  qualities  are  used,  this controls whether they are scaled by the
                 local number of mismatches to the reference.  The reference is unknown  by  this
                 tool,  so  this  data is obtained from the MD:Z auxiliary tag (or ignored if not
                 present).  Defaults to on.

       --NM-halo INT
                 Specifies the distance either  side  of  the  base  call  being  considered  for
                 computing the number of local mismatches.

       --low-MQ MIN, --high-MQ MAX
                 Specifies  a  minimum  and  maximum value of the mapping quality.  These are not
                 filters and instead simply put upper and lower caps on the values.  The defaults
                 are 0 and 60.

       --scale-MQ FLOAT
                 This is a general multiplicative  mapping quality scaling factor.  The effect is
                 to globally raise or lower the quality values used in the  consensus  algorithm.
                 Defaults to 1.0, which leaves the values unchanged.

       --P-het FLOAT
                 Controls  the likelihood of any position being a heterozygous site.  Defaults to
                 1e-4.  Smaller numbers makes the algorithm more likely to call a pure base type.
                 Note  the  algorithm  will  always  compute  the  probability  of the base being
                 homozygous vs heterozygous, irrespective of whether the output  is  reported  as
                 ambiguous  (it  will  be  "N"  if deemed to be heterozygous without --ambig mode
                 enabled).

EXAMPLES

       -      Create a modified FASTA reference that has a 1:1 coordinate correspondence with the
              original reference used in alignment.

                samtools consensus -a --show-ins no in.bam -o ref.fa

       -      Create a FASTQ file for the contigs with aligned data, including insertions.

                samtools consensus -f fastq in.bam -o cons.fq

AUTHOR

       Written by James Bonfield from the Sanger Institute.

SEE ALSO

       samtools(1), samtools-mpileup(1),

       Samtools website: <http://www.htslib.org/>