Ubuntu Manpage: samtools-ampliconstats - produces statistics from amplicon sequencing alignment file

NAME

       samtools-ampliconstats - produces statistics from amplicon sequencing alignment file

SYNOPSIS

       samtools ampliconstats [options] primers.bed in.sam|in.bam|in.cram...

DESCRIPTION

       samtools  ampliconstats  collects  statistics  from  one or more input alignment files and
       produces tables in text format.  The output can  be  visualized  graphically  using  plot-
       ampliconstats.

       The alignment files should have previously been clipped of primer sequence, for example by
       "samtools ampliconclip" and the sites of these primers should be specified as a  bed  file
       in  the  arguments.   Each  amplicon must be present in the bed file with one or more LEFT
       primers (direction "+") followed by one or more RIGHT primers.  For example:

         MN908947.3  1875  1897  nCoV-2019_7_LEFT        60  +
         MN908947.3  1868  1890  nCoV-2019_7_LEFT_alt0   60  +
         MN908947.3  2247  2269  nCoV-2019_7_RIGHT       60  -
         MN908947.3  2242  2264  nCoV-2019_7_RIGHT_alt5  60  -
         MN908947.3  2181  2205  nCoV-2019_8_LEFT        60  +
         MN908947.3  2568  2592  nCoV-2019_8_RIGHT       60  -

       Ampliconstats will identify which  read  belongs  to  which  amplicon.   For  purposes  of
       computing  coverage  statistics  for  amplicons  with  multiple  primer  choices, only the
       innermost primer locations are used.

       A summary of output sections is listed below, followed by more detailed descriptions.

       SS          Amplicon and file counts.  Always comes first
       AMPLICON    Amplicon primer locations
       FSS         File specific: summary stats
       FRPERC      File specific: read percentage distribution between amplicons
       FDEPTH      File specific: average read depth per amplicon
       FVDEPTH     File specific: average read depth per amplicon, full length only
       FREADS      File specific: numbers of reads per amplicon
       FPCOV       File specific: percent coverage per amplicon
       FTCOORD     File specific: template start,end coordinate frequencies per amplicon
       FAMP        File specific: amplicon correct / double / treble length counts
       FDP_ALL     File specific: template depth per reference base, all templates
       FDP_VALID   File specific: template depth per reference base, valid templates only
       CSS         Combined  summary stats
       CRPERC      Combined: read percentage distribution between amplicons
       CDEPTH      Combined: average read depth per amplicon
       CVDEPTH     Combined: average read depth per amplicon, full length only
       CREADS      Combined: numbers of reads per amplicon
       CPCOV       Combined: percent coverage per amplicon
       CTCOORD     Combined: template coordinates per amplicon
       CAMP        Combined: amplicon correct / double / treble length counts
       CDP_ALL     Combined: template depth per reference base, all templates
       CDP_VALID   Combined: template depth per reference base, valid templates only

       File specific sections start with both the section key and the  filename  basename  (minus
       directory and .sam, .bam or .cram suffix).

       Note  that  the file specific sections are interleaved, ordered first by file and secondly
       by the file specific stats.  To collate them together, use "grep" to pull out all data  of
       a specific type.

       The  combined  sections  (C*) follow the same format as the file specific sections, with a
       different key.  For simplicity of parsing they also have a filename column which is filled
       out with "COMBINED".  These rows contain stats aggregated across all input files.

SS / AMPLICON

       This  section is once per file and includes summary information to be utilised for scaling
       of plots, for example the total number  of  amplicons  and  files  present,  tool  version
       number,  and  command  line  arguments.   The second column is the filename or "COMBINED".
       This is followed by the reference name  (unless  single-ref  mode  is  enabled),  and  the
       summary statistic name and value.

       The  AMPLICON  section is a reformatting of the input BED file.  Each line consists of the
       reference name (unless single-ref mode is enable), the amplicon number and  the  start-end
       coordinates of the left and right primers.  Where multiple primers are available these are
       comma separated, for example 10-30,15-40 in the left primer column indicates  two  primers
       have  been  multiplex  together  covering  genome  coordinates  10-30  inclusive and 14-40
       inclusively.

CSS SECTION

       This section consists of summary counts for the entire set of input files.   These may  be
       useful for automatic scaling of plots.

       Number of amplicons   Total number of amplicons listed in primer.bed
       Number of files       Total number of SAM, BAM or CRAM files
       End of summary        Always the last item.  Marker for end of CSS block.

FSS SECTION

       This  lists  summary statistics specific to an individual input file.  The values reported
       are:

       raw total sequences   Total number of sequences found in the file
       filtered sequences    Number of sequences filtered with -F option
       failed primer match   Number of sequences that did not correspond to
                             a known primer location
       matching sequences    Number of sequences allocated to an amplicon

FREADS / CREADS SECTION

       For each amplicon, this simply reports the count of reads that have been assigned  to  it.
       A  read  is  assigned  to  an  amplicon  if  the  start and/or end of the read is within a
       specified number of bases of the primer sites listed in the bed file.   This  distance  is
       controlled via the -m option.

FRPERC / CRPERC SECTION

       For  each amplicon, this lists what percentage of reads were assigned to this amplicon out
       of the total number of assigned reads.  This may be used  to  diagnose  how  uniform  this
       distribution is.

       Note this is a pure read count and has no relation to amplicon size.

FDEPTH / CDEPTH / FVDEPTH / CVDEPTH SECTION

       Using  the  reads  assigned  to  each  amplicon  and  their  start / end locations on that
       reference, computed using the POS and CIGAR fields, we compute the total number  of  bases
       aligned  to  this  amplicon  and corresponding the average depth.  The VDEPTH variants are
       filtered to only include templates with end-to-end coverage across  the  amplicon.   These
       can  be  considered  to  be  "valid"  or  "usable" templates and give an indication of the
       minimum depth for the amplicon rather than the average depth.

       To compute the depth the length of the amplicon is computed using  the  innermost  set  of
       primers, if multiple choices are listed in the bed file.

FPCOV / CPCOV SECTION

       Similar  to  the  FDEPTH  section,  this  is a binary status of covered or not covered per
       position in each amplicon.  This is then expressed as a  percentage  by  dividing  by  the
       amplicon  length,  which  is  computed  using  the  innermost set of primers covering this
       amplicon.

       The minimum depth necessary to constitute a position as  being  "covered"  is  specifiable
       using the -d option.

FTCOORD / CTCOORD / FAMP / CAMP SECTION

       It  is  possible  for  an  amplicon to be produced using incorrect primers, giving rise to
       extra-long amplicons (typically double or treble length).

       The FTCOORD field holds a distribution of observed template  coordinates  from  the  input
       data.   Each  row  consists  of  the  file  name, the amplicon number in question, and tab
       separated tuples of start, end, frequency and status (0 for OK, 1 for skipping amplicon, 2
       for  unknown  location).   Each template is only counted for one amplicon, so if the read-
       pairs span amplicons the count will show up in the left-most amplicon covered.

       Th COORD data may indicate which primers  are  being  utilised  if  there  are  alternates
       available for a given amplicon.

       For  COORD  lines amplicon number 0 holds the frequency data for data that reads that have
       not been assigned to any amplicon.  That is, they may lie within an amplicon, but they  do
       not  start or end at a known primer location.  It is not recorded for BED files containing
       multiple references.

       The FAMP / CAMP section is a simple count per amplicon of the number of  templates  coming
       from  this  amplicon.   Templates  are counted once per amplicon, but and like the FTCOORD
       field if a read-pair spans amplicons it is only counted in the left-most  amplicon.   Each
       line  consists  of the file name, amplicon number and 3 counts for the number of templates
       with both ends within this amplicon, the number of templates with  the  rightmost  end  in
       another  amplicon,  and  the  number  of  templates  where  the other end has failed to be
       assigned to an amplicon.

       Note FAMP / CAMP amplicon number 0 is the summation of data for all amplicons (1 onwards).

FDP_ALL / CDP_ALL / FDP_VALID / CDP_VALID section

       These are for depth plots per base rather than per amplicon.  They distinguish between all
       reads  in  all  templates,  and  only  reads  in templates considered to be "valid".  Such
       templates have both reads (if  paired)  matching  known  primer  locations  from  he  same
       amplicon and have full length coverage across the entire amplicon.

       This FDP_VALID can be considered to be the minimum template depth across the amplicon.

       The  difference  between  the VALID and ALL plots represents additional data that for some
       reason may not be suitable for producing a consensus.  For example an amplicon that  skips
       a primer, pairing 10_LEFT with 12_RIGHT, will have coverage for the first half of amplicon
       10 and the last half of amplicon 12.  Counting the number of reads or bases alone  in  the
       amplicon does not reveal the potential for non-uniformity of coverage.

       The  lines start with the type keyword, file / sample name, reference name (unless single-
       ref mode is enabled), followed by a variable number of tab separated tuples consisting  of
       depth,length.   The  length  field  is a basic form of run-length encoding where all depth
       values within a specified fraction of  each  other  (e.g.  >=  (1-fract)*midpoint  and  <=
       (1+fract)*midpoint)  are  combined into a single run.  This fraction is controlled via the
       -D option.

OPTIONS

-f, --required-flag INT|STR
Only output alignments with all bits set in INT present in the FLAG field. INT
can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal
by beginning with `0' (i.e. /^0[0-7]+/) [0], or in string form by specifying a
comma-separated list of keywords as listed by the "samtools flags" subcommand.

-F, --filter-flag INT|STR
Do not output alignments with any bits set in INT present in the FLAG field. INT
can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) or in octal
by beginning with `0' (i.e. /^0[0-7]+/) [0], or in string form by specifying a
comma-separated list of keywords as listed by the "samtools flags" subcommand.

-a, --max-amplicons INT
Specify the maximum number of amplicons permitted.

-b, --tcoord-bin INT
Bin the template start,end positions into multiples of NT prior to counting their
frequency and reporting in the FTCOORD / CTCOORD lines. This may be useful for
technologies with higher errors rates where the alignment ends will vary slightly.
Defaults to 1, which is equivalent to no binning.

-c, --tcoord-min-count INT
In the FTCOORD and CTCOORD lines, only record template start,end coordinate
combination if they occur at least INT times.

-d, --min-depth INT
Specifies the minimum base depth to consider a reference position to be covered,
for purposes of the FRPERC and CRPERC sections.

-D, --depth-bin FRACTION
Controls the merging of neighbouring similar depths for the FDP_ALL and FDP_VALID
plots. The default FRACTION is 0.01, meaning depths within +/- 1% of a mid point
will be aggregated together as a run of the same value. This merging is useful to
reduce the file size. Use -D 0 to record every depth.

-l, --max-amplicon-length INT
Specifies the maximum length of any individual amplicon.

-m, --pos-margin INT
Reads are compared against the primer start and end locations specified in the BED
file. An aligned sequence should start precisely at these locations, but
sequencing errors may cause the primer clipping to be a few bases out or for the
alignment to add a few extra bases of soft clip. This option specifies the margin
of error permitted when matching a read to an amplicon number.

-o FILE
Output stats to FILE. The default is to write to stdout.

-s, --use-sample-name
Instead of using the basename component of the input path names, use the SM field
from the first @RG header line.

-S, --single-ref
Force the output format to match the older single-reference style used in Samtools
1.12 and earlier. This removes the reference names from the SS, AMPLICON, DP_ALL
and DP_VALID sections. It cannot be enabled if the input BED file has more than
one reference present. Note that plot-ampliconstats can process both output
styles.

-t, --tlen-adjust INT
Adjust the TLEN field by +/- INT to compensate for primer clipping. This defaults
to zero, but if the primers have been clipped and the TLEN field has not been
updated using samtools fixmate then the template length will be wrong by the sum
of the forward and reverse primer lengths.

This adjustment does not have to be precise as the --pos-margin field permits some
leeway. Hence if required, it should be set to approximately double the average
primer length.

-@ INT Number of BAM/CRAM (de)compression threads to use in addition to main thread [0].

EXAMPLE

       To run ampliconstats on a directory full of CRAM files and then produce a  series  of  PNG
       images named "mydata*.png":

         samtools ampliconstats V3/nCoV-2019.bed /path/*.cram > astats
         plot-ampliconstats -size 1200,900 mydata astats

AUTHOR

       Written by James Bonfield from the Sanger Institute.