Provided by: samtools_1.20-3_amd64 bug

NAME

       samtools-stats - produces comprehensive statistics from alignment file

SYNOPSIS

       samtools stats [options] in.sam|in.bam|in.cram [region...]

DESCRIPTION

       samtools  stats  collects  statistics  from  BAM  files and outputs in a text format.  The
       output can be visualized graphically using plot-bamstats.

       A summary of output sections is listed below, followed by more detailed descriptions.

       CHK    Checksum
       SN     Summary numbers
       FFQ    First fragment qualities
       LFQ    Last fragment qualities
       GCF    GC content of first fragments
       GCL    GC content of last fragments
       GCC    ACGT content per cycle
       GCT    ACGT content per cycle, read oriented
       FBC    ACGT content per cycle for first fragments only
       FTC    ACGT raw counters for first fragments
       LBC    ACGT content per cycle for last fragments only
       LTC    ACGT raw counters for last fragments
       BCC    ACGT content per cycle for BC barcode
       CRC    ACGT content per cycle for CR barcode
       OXC    ACGT content per cycle for OX barcode
       RXC    ACGT content per cycle for RX barcode
       MPC    Mismatch distribution per cycle
       QTQ    Quality distribution for BC barcode
       CYQ    Quality distribution for CR barcode
       BZQ    Quality distribution for OX barcode
       QXQ    Quality distribution for RX barcode
       IS     Insert sizes
       RL     Read lengths
       FRL    Read lengths for first fragments only
       LRL    Read lengths for last fragments only
       MAPQ   Mapping qualities
       ID     Indel size distribution
       IC     Indels per cycle
       COV    Coverage (depth) distribution
       GCD    GC-depth

       The "cycle" terminology used here originates from the  Illumina  instruments,  but  it  is
       interpreted  more  generally  as  the  Nth  base reported in the original read orientation
       (starting from 1).

       Not all sections will be reported as some depend on the data being coordinate sorted while
       others are only present when specific barcode tags are in use.

       Some  of  the  statistics  are collected for “first” or “last” fragments.  Records are put
       into these categories using the PAIRED (0x1), READ1 (0x40) and READ2 (0x80) flag bits,  as
       follows:

       •   Unpaired reads (i.e. PAIRED is not set) are all “first” fragments.  For these records,
           the READ1 and READ2 flags are ignored.

       •   Reads where PAIRED and READ1 are set, and READ2 is not set are “first” fragments.

       •   Reads where PAIRED and READ2 are set, and READ1 is not set are “last” fragments.

       •   Reads where PAIRED is set and either both READ1 and READ2 are set or  neither  is  set
           are not counted in either category.

       Information  on  the  meaning  of  the  flags  is  given in the SAM specification document
       <https://samtools.github.io/hts-specs/SAMv1.pdf>.

       The CHK row contains distinct CRC32 checksums of read names, sequences and quality values.
       The  checksums are computed per alignment record and summed, meaning the checksum does not
       change if the input file has the sort-order changed.

       The SN section contains a series of counts, percentages, and averages, in a similar  style
       to samtools flagstat, but more comprehensive.

              raw  total sequences - total number of reads in a file, excluding supplementary and
              secondary reads.  Same number reported by samtools view -c -F 0x900.

              filtered sequences - number of discarded reads when using -f or -F option.

              sequences - number of processed reads.

              is sorted - flag indicating whether the file is coordinate sorted (1) or not (0).

              1st fragments - number of first fragment reads (flags 0x01 not set; or  flags  0x01
              and 0x40 set, 0x80 not set).

              last  fragments  - number of last fragment reads (flags 0x01 and 0x80 set, 0x40 not
              set).

              reads mapped - number of reads, paired or single, that are mapped (flag 0x4 or  0x8
              not set).

              reads  mapped and paired - number of mapped paired reads (flag 0x1 is set and flags
              0x4 and 0x8 are not set).

              reads unmapped - number of unmapped reads (flag 0x4 is set).

              reads properly paired - number of mapped paired reads with flag 0x2 set.

              paired - number of paired reads, mapped or unmapped, that are neither secondary nor
              supplementary (flag 0x1 is set and flags 0x100 (256) and 0x800 (2048) are not set).

              reads duplicated - number of duplicate reads (flag 0x400 (1024) is set).

              reads MQ0 - number of mapped reads with mapping quality 0.

              reads  QC failed - number of reads that failed the quality checks (flag 0x200 (512)
              is set).

              non-primary alignments - number of secondary reads (flag 0x100 (256) set).

              supplementary alignments - number of supplementary reads (flag 0x800 (2048) set).

              total length - number of processed bases from reads that are neither secondary  nor
              supplementary (flags 0x100 (256) and 0x800 (2048) are not set).

              total  first  fragment  length  -  number  of  processed bases that belong to first
              fragments.

              total last fragment length  -  number  of  processed  bases  that  belong  to  last
              fragments.

              bases mapped - number of processed bases that belong to reads mapped.

              bases  mapped  (cigar)  -  number  of  mapped  bases  filtered  by the CIGAR string
              corresponding to the read they belong to. Only  alignment  matches(M),  inserts(I),
              sequence matches(=) and sequence mismatches(X) are counted.

              bases  trimmed  -  number of bases trimmed by bwa, that belong to non secondary and
              non supplementary reads. Enabled by -q option.

              bases duplicated - number of bases that belong to reads duplicated.

              mismatches - number of mismatched bases, as reported by the NM tag associated  with
              a read, if present.

              error rate - ratio between mismatches and bases mapped (cigar).

              average length - ratio between total length and sequences.

              average  first  fragment length - ratio between total first fragment length and 1st
              fragments.

              average last fragment length - ratio between total last fragment  length  and  last
              fragments.

              maximum length - length of the longest read (includes hard-clipped bases).

              maximum first fragment length - length of the longest first fragment read (includes
              hard-clipped bases).

              maximum last fragment length - length of the longest last fragment  read  (includes
              hard-clipped bases).

              average quality - ratio between the sum of base qualities and total length.

              insert  size  average  - the average absolute template length for paired and mapped
              reads.

              insert size standard deviation - standard deviation for the average template length
              distribution.

              inward  oriented  pairs  -  number of paired reads with flag 0x40 (64) set and flag
              0x10 (16) not set or with flag 0x80 (128) set and flag 0x10 (16) set.

              outward oriented pairs - number of paired reads with flag 0x40 (64)  set  and  flag
              0x10 (16) set or with flag 0x80 (128) set and flag 0x10 (16) not set.

              pairs with other orientation - number of paired reads that don't fall in any of the
              above two categories.

              pairs on different chromosomes  -  number  of  pairs  where  one  read  is  on  one
              chromosome and the pair read is on a different chromosome.

              percentage  of  properly  paired reads - percentage of reads properly paired out of
              sequences.

              bases inside the target - number of bases  inside  the  target  region(s)  (when  a
              target file is specified with -t option).

              percentage of target genome with coverage > VAL - percentage of target bases with a
              coverage larger than VAL. By default, VAL is 0, but a custom value can be  supplied
              by the user with -g option.

       The  FFQ  and LFQ sections report the quality distribution per first/last fragment and per
       cycle number.  They have one row per cycle (reported as the first column after the FFQ/LFQ
       key)  with remaining columns being the observed integer counts per quality value, starting
       at quality 0 in the left-most row and ending at the largest observed quality.   Thus  each
       row  forms  its  own  quality distribution and any cycle specific quality artefacts can be
       observed.

       GCF and GCL report the total GC content of each fragment, separated into  first  and  last
       fragments.  The columns show the GC percentile (between 0 and 100) and an integer count of
       fragments at that percentile.

       GCC, FBC and LBC report the nucleotide content per cycle either combined  (GCC)  or  split
       into  first  (FBC)  and last (LBC) fragments.  The columns are cycle number (integer), and
       percentage counts for A, C, G, T, N  and  other  (typically  containing  ambiguity  codes)
       normalised against the total counts of A, C, G and T only (excluding N and other).

       GCT  offers  a similar report to GCC, but whereas GCC counts nucleotides as they appear in
       the SAM output (in reference orientation), GCT takes into  account  whether  a  nucleotide
       belongs to a reverse complemented read and counts it in the original read orientation.  If
       there are no reverse complemented reads in a  file,  the  GCC  and  GCT  reports  will  be
       identical.

       FTC  and  LTC  report  the  total  numbers  of  nucleotides  for first and last fragments,
       respectively. The columns are the raw counters for A, C, G, T and N bases.

       MPC reports the number of mismatches per cycle and per quality value.  The MPC  statistics
       are  only  included when a reference is specified via the -r option.  There is one row per
       cycle number.  Each row includes the cycle number, the number of N bases (not  counted  in
       the  per-qual  columns),  followed  by  one column per quality value (starting at zero and
       incrementing by one each time) listing the number of non-N mismatches with  that  quality.
       A  mismatch  is  defined  as  an  ACGT  sequence  base mismatching an ACGT reference base.
       Ambiguity codes are ignored (except for sequence N as mentioned above,  which  is  counted
       even when the reference is also N).

       BCC,  CRC,  OXC  and RXC are the barcode equivalent of GCC, showing nucleotide content for
       the barcode tags BC, CR, OX and RX respectively.  Their quality values  distributions  are
       in  the QTQ, CYQ, BZQ and QXQ sections, corresponding to the BC/QT, CR/CY, OX/BZ and RX/QX
       SAM format sequence/quality tags.  These  quality  value  distributions  follow  the  same
       format  used in the FFQ and LFQ sections. All these section names are followed by a number
       (1 or 2), indicating that the stats figures below them correspond to the first  or  second
       barcode  (in  the  case of dual indexing). Thus, these sections will appear as BCC1, CRC1,
       OXC1 and RXC1, accompanied by their quality correspondents QTQ1, CYQ1, BZQ1 and QXQ1. If a
       separator is present in the barcode sequence (usually a hyphen), indicating dual indexing,
       then sections ending in "2" will also be reported to show the second tag statistics  (e.g.
       both BCC1 and BCC2 are present).

       IS  reports insert size distributions with one row per size, reported in the first column,
       with subsequent columns for the frequency of total pairs, inward oriented  pairs,  outward
       orient pairs and other orientation pairs.  The -i option specifies the maximum insert size
       reported.

       RL reports the distribution for all read lengths, with one row per observed length (up  to
       the  maximum specified by the -l option).  Columns are read length and frequency.  FRL and
       LRL contains the same information separated into first and last fragments.

       MAPQ reports the  mapping  qualities  for  the  mapped  reads,  ignoring  the  duplicates,
       supplementary, secondary and failing quality reads.

       ID  reports  the  distribution of indel sizes, with one row per observed size. The columns
       are size, frequency of insertions at that size and frequency of deletions at that size.

       IC reports the frequency of indels occurring per cycle, broken down by  both  insertion  /
       deletion  and by first / last read.  Note for multi-base indels this only counts the first
       base location.  Columns are cycle, number of insertions  in  first  fragments,  number  of
       insertions  in  last  fragments,  number  of  deletions  in first fragments, and number of
       deletions in last fragments.

       COV reports a distribution of the alignment depth per covered reference site.  For example
       an  average  depth  of 50 would ideally result in a normal distribution centred on 50, but
       the presence of repeats or copy-number variation may reveal multiple peaks at  approximate
       multiples  of  50.   The  first column is an inclusive coverage range in the form of [min-
       max].  The next columns are a repeat of the maximum portion of the depth range (now  as  a
       single integer) and the frequency that depth range was observed.  The minimum, maximum and
       range step size are controlled by the -c option.  Depths above and below the  minimum  and
       maximum are reported with ranges [<min] and [max<].

       GCD  reports  the  GC  content of the reference data aligned against per alignment record,
       with one row per observed GC percentage reported as the first column and  sorted  on  this
       column.   The  second column is a total sequence percentile, as a running total (ending at
       100%).  The first and second columns may be used to produce a simple  distribution  of  GC
       content.  Subsequent columns list the coverage depth at 10th, 25th, 50th, 75th and 90th GC
       percentiles for this specific GC percentage, revealing any  GC  bias  in  mapping.   These
       columns are averaged depths, so are floating point with no maximum value.

OPTIONS

       -c, --coverage MIN,MAX,STEP
               Set  coverage  distribution  to  the  specified range (MIN, MAX, STEP all given as
               integers) [1,1000,1]

       -d, --remove-dups
               Exclude from statistics reads marked as duplicates

       -f, --required-flag STR|INT
               Required flag, 0 for unset. See also `samtools flags` [0]

       -F, --filtering-flag STR|INT
               Filtering flag, 0 for unset. See also `samtools flags` [0]

       --GC-depth FLOAT
               the size of GC-depth bins (decreasing bin size increases memory requirement) [2e4]

       -h, --help
               This help message

       -i, --insert-size INT
               Maximum insert size [8000]

       -I, --id STR
               Include only listed read group or sample name []

       -l, --read-length INT
               Include in the statistics only reads with the given read length [-1]

       -m, --most-inserts FLOAT
               Report only the main part of inserts [0.99]

       -P, --split-prefix STR
               A path or string prefix to prepend to filenames output when  creating  categorised
               statistics files with -S/--split.  [input filename]

       -q, --trim-quality INT
               The BWA trimming parameter [0]

       -r, --ref-seq FILE
               Reference  sequence  (required for GC-depth and mismatches-per-cycle calculation).
               []

       -S, --split TAG
               In addition to the complete statistics, also output categorised  statistics  based
               on the tagged field TAG (e.g., use --split RG to split into read groups).

               Categorised  statistics are written to files named <prefix>_<value>.bamstat, where
               prefix is as given by --split-prefix (or the input filename by default) and  value
               has  been  encountered  as  the  specified  tagged  field's  value  in one or more
               alignment records.

       -t, --target-regions FILE
               Do  stats  in  these  regions  only.  Tab-delimited  file  chr,from,to,   1-based,
               inclusive.  []

       -x, --sparse
               Suppress outputting IS rows where there are no insertions.

       -p, --remove-overlaps
               Remove overlaps of paired-end reads from coverage and base count computations.

       -g, --cov-threshold INT
               Only  bases  with  coverage  above  this  value  will  be  included  in the target
               percentage computation [0]

       -X      If this option is set, it will  allows  user  to  specify  customized  index  file
               location(s)  if  the  data folder does not contain any index file.  Example usage:
               samtools stats [options] -X /data_folder/data.bam /index_folder/data.bai chrM:1-10

       -@, --threads INT
               Number of input/output compression threads to use in addition to main thread [0].

AUTHOR

       Written by Petr Danacek with major modifications by Nicholas Clarke, Martin Pollard,  Josh
       Randall, and Valeriu Ohan, all from the Sanger Institute.

SEE ALSO

       samtools(1), samtools-flagstat(1), samtools-idxstats(1)

       Samtools website: <http://www.htslib.org/>