Provided by: samtools_1.22.1-1_amd64 bug

NAME

       samtools-stats - produces comprehensive statistics from alignment file

SYNOPSIS

       samtools stats [options] in.sam|in.bam|in.cram [region...]

DESCRIPTION

       samtools  stats  collects  statistics  from  BAM  files  and outputs in a text format.  The output can be
       visualized graphically using plot-bamstats.

       A summary of output sections is listed below, followed by more detailed descriptions.

       CHK    Checksum
       SN     Summary numbers
       FFQ    First fragment qualities
       LFQ    Last fragment qualities
       GCF    GC content of first fragments
       GCL    GC content of last fragments
       GCC    ACGT content per cycle
       GCT    ACGT content per cycle, read oriented
       FBC    ACGT content per cycle for first fragments only
       FTC    ACGT raw counters for first fragments
       LBC    ACGT content per cycle for last fragments only
       LTC    ACGT raw counters for last fragments
       BCC    ACGT content per cycle for BC barcode
       CRC    ACGT content per cycle for CR barcode
       OXC    ACGT content per cycle for OX barcode
       RXC    ACGT content per cycle for RX barcode
       MPC    Mismatch distribution per cycle
       QTQ    Quality distribution for BC barcode
       CYQ    Quality distribution for CR barcode
       BZQ    Quality distribution for OX barcode
       QXQ    Quality distribution for RX barcode
       IS     Insert sizes
       RL     Read lengths
       FRL    Read lengths for first fragments only
       LRL    Read lengths for last fragments only
       MAPQ   Mapping qualities
       ID     Indel size distribution
       IC     Indels per cycle
       COV    Coverage (depth) distribution
       GCD    GC-depth

       The "cycle" terminology used here originates from the Illumina instruments, but it  is  interpreted  more
       generally as the Nth base reported in the original read orientation (starting from 1).

       Not  all  sections  will  be reported as some depend on the data being coordinate sorted while others are
       only present when specific barcode tags are in use.

       Some of the statistics are collected for “first”  or  “last”  fragments.   Records  are  put  into  these
       categories using the PAIRED (0x1), READ1 (0x40) and READ2 (0x80) flag bits, as follows:

       •   Unpaired  reads (i.e. PAIRED is not set) are all “first” fragments.  For these records, the READ1 and
           READ2 flags are ignored.

       •   Reads where PAIRED and READ1 are set, and READ2 is not set are “first” fragments.

       •   Reads where PAIRED and READ2 are set, and READ1 is not set are “last” fragments.

       •   Reads where PAIRED is set and either both READ1 and READ2 are set or neither is set are  not  counted
           in either category.

       Information   on   the   meaning   of   the   flags   is   given   in   the  SAM  specification  document
       <https://samtools.github.io/hts-specs/SAMv1.pdf>.

       The CHK row contains distinct CRC32 checksums of read names, sequences and quality values.  The checksums
       are computed per alignment record and summed, meaning the checksum does not change if the input file  has
       the sort-order changed.
       NOTE:  Checksum  calculation for quality values was modified and quality checksum value will be different
       from that generated using versions up to 1.21.

       The SN section contains a series of counts, percentages, and averages, in a  similar  style  to  samtools
       flagstat, but more comprehensive.

              raw  total  sequences  -  total  number  of reads in a file, excluding supplementary and secondary
              reads.  Same number reported by samtools view -c -F 0x900.

              filtered sequences - number of discarded reads when using -f or -F option.

              sequences - number of processed reads.

              is sorted - flag indicating whether the file is coordinate sorted (1) or not (0).

              1st fragments - number of first fragment reads (flags 0x01 not set; or flags 0x01  and  0x40  set,
              0x80 not set).

              last fragments - number of last fragment reads (flags 0x01 and 0x80 set, 0x40 not set).

              reads mapped - number of reads, paired or single, that are mapped (flag 0x4 or 0x8 not set).

              reads mapped and paired - number of mapped paired reads (flag 0x1 is set and flags 0x4 and 0x8 are
              not set).

              reads unmapped - number of unmapped reads (flag 0x4 is set).

              reads properly paired - number of mapped paired reads with flag 0x2 set.

              paired  - number of paired reads, mapped or unmapped, that are neither secondary nor supplementary
              (flag 0x1 is set and flags 0x100 (256) and 0x800 (2048) are not set).

              reads duplicated - number of duplicate reads (flag 0x400 (1024) is set).

              reads MQ0 - number of mapped reads with mapping quality 0.

              reads QC failed - number of reads that failed the quality checks (flag 0x200 (512) is set).

              non-primary alignments - number of secondary reads (flag 0x100 (256) set).

              supplementary alignments - number of supplementary reads (flag 0x800 (2048) set).

              total length - number of processed bases from reads that are neither secondary  nor  supplementary
              (flags 0x100 (256) and 0x800 (2048) are not set).

              total first fragment length - number of processed bases that belong to first fragments.

              total last fragment length - number of processed bases that belong to last fragments.

              bases mapped - number of processed bases that belong to reads mapped.

              bases  mapped  (cigar)  - number of mapped bases filtered by the CIGAR string corresponding to the
              read they belong to. Only alignment  matches(M),  inserts(I),  sequence  matches(=)  and  sequence
              mismatches(X) are counted.

              bases trimmed - number of bases trimmed by bwa, that belong to non secondary and non supplementary
              reads. Enabled by -q option.

              bases duplicated - number of bases that belong to reads duplicated.

              mismatches  -  number  of  mismatched  bases, as reported by the NM tag associated with a read, if
              present.

              error rate - ratio between mismatches and bases mapped (cigar).

              average length - ratio between total length and sequences.

              average first fragment length - ratio between total first fragment length and 1st fragments.

              average last fragment length - ratio between total last fragment length and last fragments.

              maximum length - length of the longest read (includes hard-clipped bases).

              maximum first fragment length - length of the longest first fragment read  (includes  hard-clipped
              bases).

              maximum  last  fragment  length  - length of the longest last fragment read (includes hard-clipped
              bases).

              average quality - ratio between the sum of base qualities and total length.

              insert size average - the average absolute template length for paired and mapped reads.

              insert size standard deviation - standard deviation for the average template length distribution.

              inward oriented pairs - number of paired reads with flag 0x40 (64) set and flag 0x10 (16) not  set
              or with flag 0x80 (128) set and flag 0x10 (16) set.

              outward  oriented pairs - number of paired reads with flag 0x40 (64) set and flag 0x10 (16) set or
              with flag 0x80 (128) set and flag 0x10 (16) not set.

              pairs with other orientation - number of paired reads that don't fall in  any  of  the  above  two
              categories.

              pairs  on different chromosomes - number of pairs where one read is on one chromosome and the pair
              read is on a different chromosome.

              percentage of properly paired reads - percentage of reads properly paired out of sequences.

              bases inside the target - number of bases inside the target  region(s)  (when  a  target  file  is
              specified with -t option).

              percentage  of  target  genome  with  coverage  > VAL - percentage of target bases with a coverage
              larger than VAL. By default, VAL is 0, but a custom value can be supplied  by  the  user  with  -g
              option.

       The  FFQ  and  LFQ sections report the quality distribution per first/last fragment and per cycle number.
       They have one row per cycle (reported as the first column after the FFQ/LFQ key) with  remaining  columns
       being  the  observed  integer  counts  per  quality value, starting at quality 0 in the left-most row and
       ending at the largest observed quality.  Thus each row forms its own quality distribution and  any  cycle
       specific quality artefacts can be observed.

       GCF  and  GCL report the total GC content of each fragment, separated into first and last fragments.  The
       columns show the GC percentile (between 0 and 100) and an integer count of fragments at that percentile.

       GCC, FBC and LBC report the nucleotide content per cycle either combined (GCC) or split into first  (FBC)
       and  last (LBC) fragments.  The columns are cycle number (integer), and percentage counts for A, C, G, T,
       N and other (typically containing ambiguity codes) normalised against the total counts of A, C, G  and  T
       only (excluding N and other).

       GCT  offers  a similar report to GCC, but whereas GCC counts nucleotides as they appear in the SAM output
       (in reference orientation), GCT takes into account whether a nucleotide belongs to a reverse complemented
       read and counts it in the original read orientation.  If there are no reverse  complemented  reads  in  a
       file, the GCC and GCT reports will be identical.

       FTC  and  LTC  report  the  total  numbers of nucleotides for first and last fragments, respectively. The
       columns are the raw counters for A, C, G, T and N bases.

       MPC reports the number of mismatches per cycle and per  quality  value.   The  MPC  statistics  are  only
       included  when  a reference is specified via the -r option.  There is one row per cycle number.  Each row
       includes the cycle number, the number of N bases (not counted in the per-qual columns), followed  by  one
       column per quality value (starting at zero and incrementing by one each time) listing the number of non-N
       mismatches  with  that  quality.   A  mismatch  is  defined  as an ACGT sequence base mismatching an ACGT
       reference base.  Ambiguity codes are ignored (except for sequence N as mentioned above, which is  counted
       even when the reference is also N).

       BCC,  CRC, OXC and RXC are the barcode equivalent of GCC, showing nucleotide content for the barcode tags
       BC, CR, OX and RX respectively.  Their quality values distributions are in the  QTQ,  CYQ,  BZQ  and  QXQ
       sections,  corresponding  to  the  BC/QT, CR/CY, OX/BZ and RX/QX SAM format sequence/quality tags.  These
       quality value distributions follow the same format used in the FFQ and LFQ sections.  All  these  section
       names  are  followed by a number (1 or 2), indicating that the stats figures below them correspond to the
       first or second barcode (in the case of dual indexing). Thus, these sections will appear as  BCC1,  CRC1,
       OXC1  and  RXC1, accompanied by their quality correspondents QTQ1, CYQ1, BZQ1 and QXQ1. If a separator is
       present in the barcode sequence (usually a hyphen), indicating dual indexing, then sections ending in "2"
       will also be reported to show the second tag statistics (e.g. both BCC1 and BCC2 are present).

       IS reports insert size distributions with one row per size, reported in the first column, with subsequent
       columns for the frequency of  total  pairs,  inward  oriented  pairs,  outward  orient  pairs  and  other
       orientation pairs.  The -i option specifies the maximum insert size reported.

       RL  reports  the  distribution  for all read lengths, with one row per observed length (up to the maximum
       specified by the -l option).  Columns are read length and frequency.   FRL  and  LRL  contains  the  same
       information separated into first and last fragments.

       MAPQ  reports  the  mapping  qualities  for  the  mapped  reads,  ignoring the duplicates, supplementary,
       secondary and failing quality reads.

       ID reports the distribution of indel sizes, with one  row  per  observed  size.  The  columns  are  size,
       frequency of insertions at that size and frequency of deletions at that size.

       IC  reports  the frequency of indels occurring per cycle, broken down by both insertion / deletion and by
       first / last read.  Note for multi-base indels this only counts the first  base  location.   Columns  are
       cycle,  number  of  insertions  in  first  fragments,  number  of insertions in last fragments, number of
       deletions in first fragments, and number of deletions in last fragments.

       COV reports a distribution of the alignment depth per covered reference site.   For  example  an  average
       depth  of  50 would ideally result in a normal distribution centred on 50, but the presence of repeats or
       copy-number variation may reveal multiple peaks at approximate multiples of 50.  The first column  is  an
       inclusive  coverage range in the form of [min-max].  The next columns are a repeat of the maximum portion
       of the depth range (now as a single integer) and the  frequency  that  depth  range  was  observed.   The
       minimum, maximum and range step size are controlled by the -c option.  Depths above and below the minimum
       and maximum are reported with ranges [<min] and [max<].

       GCD  reports  the GC content of the reference data aligned against per alignment record, with one row per
       observed GC percentage reported as the first column and sorted on this column.  The second  column  is  a
       total sequence percentile, as a running total (ending at 100%).  The first and second columns may be used
       to  produce  a  simple  distribution  of GC content.  Subsequent columns list the coverage depth at 10th,
       25th, 50th, 75th and 90th GC percentiles for this specific  GC  percentage,  revealing  any  GC  bias  in
       mapping.  These columns are averaged depths, so are floating point with no maximum value.

OPTIONS

       -c, --coverage MIN,MAX,STEP
               Set  coverage  distribution  to  the  specified  range  (MIN,  MAX,  STEP  all given as integers)
               [1,1000,1]

       -d, --remove-dups
               Exclude from statistics reads marked as duplicates

       -f, --required-flag STR|INT
               Required flag, 0 for unset. See also `samtools flags` [0]

       -F, --filtering-flag STR|INT
               Filtering flag, 0 for unset. See also `samtools flags` [0]

       --GC-depth FLOAT
               the size of GC-depth bins (decreasing bin size increases memory requirement) [2e4]

       -h, --help
               This help message

       -i, --insert-size INT
               Maximum insert size [8000]

       -I, --id STR
               Include only listed read group or sample name []

       -l, --read-length INT
               Include in the statistics only reads with the given read length [-1]

       -m, --most-inserts FLOAT
               Report only the main part of inserts [0.99]

       -P, --split-prefix STR
               A path or string prefix to prepend to filenames output when creating categorised statistics files
               with -S/--split.  [input filename]

       -q, --trim-quality INT
               The BWA trimming parameter [0]

       -r, --ref-seq FILE
               Reference sequence (required for GC-depth and mismatches-per-cycle calculation).  []

       -S, --split TAG
               In addition to the complete statistics, also output categorised statistics based  on  the  tagged
               field TAG (e.g., use --split RG to split into read groups).

               Categorised  statistics  are  written to files named <prefix>_<value>.bamstat, where prefix is as
               given by --split-prefix (or the input filename by default) and value has been encountered as  the
               specified tagged field's value in one or more alignment records.

       -t, --target-regions FILE
               Do stats in these regions only. Tab-delimited file chr,from,to, 1-based, inclusive.  []

       -x, --sparse
               Suppress outputting IS rows where there are no insertions.

       -p, --remove-overlaps
               Remove overlaps of paired-end reads from coverage and base count computations.

       -g, --cov-threshold INT
               Only  bases  with coverage above this value will be included in the target percentage computation
               [0]

       -X      If this option is set, it will allows user to specify customized index file  location(s)  if  the
               data  folder  does  not  contain  any  index  file.   Example  usage: samtools stats [options] -X
               /data_folder/data.bam /index_folder/data.bai chrM:1-10

       -@, --threads INT
               Number of input/output compression threads to use in addition to main thread [0].

AUTHOR

       Written by Petr Danacek with major modifications by Nicholas Clarke, Martin Pollard,  Josh  Randall,  and
       Valeriu Ohan, all from the Sanger Institute.

SEE ALSO

       samtools(1), samtools-flagstat(1), samtools-idxstats(1)

       Samtools website: <http://www.htslib.org/>

samtools-1.22.1                                   14 July 2025                                 samtools-stats(1)