lunar (1) axe-demux.1.gz

Provided by: axe-demultiplexer_0.3.3+dfsg-3build1_amd64 bug

NAME

       axe - axe Documentation

       Axe  is  a  read  de-multiplexer,  useful  in  situations where sequence reads contain the
       indexes that uniquely distinguish samples. Axe uses a rapid and accurate  algorithm  based
       on hamming mismatch tries to competitively match the prefix of a sequencing read against a
       set of indexes. Axe supports combinatorial indexing schemes.

       Contents:

AXE TUTORIAL

       In this tutorial, we'll use Axe  to  demultiplex  some  paired-end,  combinatorially-index
       Genotyping-by-Sequencing  reads.  The  data  for this tutorial is available from figshare:
       https://figshare.com/articles/axe-tutorial_tar/6143720 .

       Axe should be run as the initial step of any analysis: don't use sequence  QC  tools  like
       AdapterRemoval or Trimmomatic before using axe, as indexes may be trimmed away, or pairing
       information removed.

   Step 0: Download the trial data
       This will download the trial data, and extract it on the fly:

          curl -LS https://ndownloader.figshare.com/files/11094782 | tar xv

   Step 1: prepare a key file
       The key file associates index sequences with sample names. A key file can be prepared in a
       spreadsheet  editor,  like  LibreOffice Calc, or Excel. The format is quite strict, and is
       described in detail in the online usage documentation.

       Let's now inspect the keyfile I have provided for the tutorial.

          head axe-keyfile.tsv

   Step 2: Demultiplex with Axe
       In this step, we will demultiplex our interleaved input  file  to  per-sample  interleaved
       output  files.  To  see a full range of Axe's options, please run axe-demux -h, or inspect
       the online usage documentation.

       First, let's inspect the input.

          zcat axe-tutorial.fastq.gz | head -n 8

       Then, we need to ensure that axe has  somewhere  to  put  the  demultiplexed  reads.   Axe
       outputs  one file (or more, depending on pairing) per sample. Axe does so by appending the
       sample name to some prefix (as given by the -I, -F, and/or -R options). If this prefix  is
       a  directory,  then  sample  fastq  files  will  be created in that sub-directory, but the
       directory must exist. Let's make an output directory:

          mkdir -p output

       Now, let's demultiplex the reads!

          axe-demux -i axe-tutorial.fastq.gz -I output/ \
             -c -b axe-keyfile.tsv -t demux-stats.tsv -z 1

       The command above demultiplexes reads from axe-tutorial.fastq.gz into separate files under
       output,  based  on  the  combinatorial  (-c) sample-to-index-sequence mapping described in
       axe-keyfile.tsv, and saves a file of statistics as  demux-stats.tsv.  Note  that  we  have
       enabled  compression of output files using the -z option, in case you don't have much disk
       space available. This will make Axe slightly slower.

AXE USAGE

       NOTE:
          For arcane reasons, the name of the axe binary changed to axe-demux with version 0.3.0.
          Apologies  for  the  inconvenience, this was required to make axe installable in Debian
          and its derivatives. Command-line usage did not change.

       Axe has several usage modes. The primary distinction is between the two alternate indexing
       schemes,  single  and  combinatorial indexing. Single index matching is used when only the
       first read contains index sequences.  Combinatorial indexing is used when both reads in  a
       read pair contain independent (typically different) index sequences.

       For concise reference, the command-line usage of axe-demux is reproduced below:

          USAGE:
          axe-demux [-mzc2pt] -b (-f [-r] | -i) (-F [-R] | -I)
          axe-demux -h
          axe-demux -v

          OPTIONS:
              -m, --mismatch  Maximum hamming distance mismatch. [int, default 1]
              -z, --ziplevel  Gzip compression level, or 0 for plain text [int, default 0]
              -c, --combinatorial  Use combinatorial barcode matching. [flag, default OFF]
              -p, --permissive     Don't error on barcode mismatch confict, matching only
                                   exactly for conficting barcodes. [flag, default OFF]
              -2, --trim-r2   Trim barcode from R2 read as well as R1. [flag, default OFF]
              -b, --barcodes  Barcode file. See --help for example. [file]
              -f, --fwd-in    Input forward read. [file]
              -F, --fwd-out   Output forward read prefix. [file]
              -r, --rev-in    Input reverse read. [file]
              -R, --rev-out   Output reverse read prefix. [file]
              -i, --ilfq-in   Input interleaved paired reads. [file]
              -I, --ilfq-out  Output interleaved paired reads prefix. [file]
              -t, --table-file     Output a summary table of demultiplexing statistics to file. [file]
              -h, --help      Print this usage plus additional help.
              -V, --version   Print version string.
              -v, --verbose   Be more verbose. Additive, -vv is more vebose than -v.
              -q, --quiet          Be very quiet.

   Inputs and Outputs
       Regardless  of  read mode, three input and output schemes are supported: single-end reads,
       paired reads (separate R1 and R2 files) and interleaved paired reads (one  file,  with  R1
       and  R2  as  consecutive  reads). If single end reads are inputted, they must be output as
       single end reads. If either paired or interleaved paired  reads  are  read,  they  can  be
       output  as  either  paired  reads  or  interleaved  paired  reads.  This  applies  to both
       successfully de-multiplexed reads and reads that could not be de-multiplexed.

       The -z flag can  be  used  to  specify  that  outputs  should  be  compressed  using  gzip
       compression.  The -z flag takes an integer argument between 0 (the default) and 9, where 0
       indicates plain text output (gzopen mode "wT"),  and  1-9  indicate  that  the  respective
       compression level should be used, where 1 is fastest and 9 is most compact.

       The  output  flags should be prefixes that are used to generate the output file name based
       on the index's (or index pair's) ID. The names are generated as: prefix + _ + index ID + _
       +  read number + .extension.  The output file for reads that could not be demultiplexed is
       prefix + _ + unknown + _ + read number + .extension.  The read number  is  omitted  unless
       the  paired read file scheme is used, and is "il" for interleaved output. The extension is
       "fastq"; ".gz" is appended to the extension if the -z flag is used.

       The corresponding CLI flags are:-f and -F: Single end or paired R1 file input and output respectively.

              • -r and -R: Paired R2 file input and output.

              • -i and -I: Interleaved paired input and output.

   The index file
       The index file is a tab-separated file with an optional header. It is  mandatory,  and  is
       always  supplied using the -b command line flag. The exact format is dependent on indexing
       mode, and is described further in the sections below. If a header is present,  the  header
       line  must  start with either Barcode or index, or it will be interpreted as a index line,
       leading to a parsing error. Any line  starting  with  ';'  or  '#'  is  ignored,  allowing
       comments to be added in line with indexes. Please ensure that the software used to produce
       the index uses ASCII encoding, and does not insert a Byte-order Mark (BoM)  as  many  text
       editors  can  silently  use  Unicode-based  encoding  schemes.  I  recommend  the  use  of
       LibreOffice Calc (part of a free and open source office suite) to generate  index  tables;
       Microsoft Excel can also be used.

   Mismatch level selection
       Independent  of  index  mode,  the -m flag is used to select the maximum allowable hamming
       distance between a read's prefix and a index to be considered as  a  match.  As  "mutated"
       indexes  must be unique, a hamming distance of one is the default as typically indexes are
       designed to differ by a hamming distance of at least two. Optionally, (using the -p flag),
       axe  will  allow selective mismatch levels, where, if clashes are observed, the index will
       only be matched exactly. This allows one to process datasets with indexes that don't  have
       a sufficiently high distance between them.

   Single index mode
       Single  index mode is the default mode of operation. Barcodes are matched against read one
       (hereafter the forward read), and the index is trimmed from only the forward read,  unless
       the  -2  command line flag is given, in which case a prefix the same length as the matched
       index is also trimmed from the second or reverse read. Note that sequence of  this  second
       read is not checked before trimming.

       In single index mode, the index file has two columns: Barcode and ID.

   Combinatorial index mode
       Combinatorial  index  mode is activated by giving the -c flag on the command line. Forward
       read indexes are matched against the forward read, and reverse read  indexes  are  matched
       against  the  reverse  read. The optimal indexes are selected independently, and the index
       pair is selected from these two indexes. The respective  indexes  are  trimmed  from  both
       reads; the -2 command line flag has no effect in combinatorial index mode.

       In  combinatorial index mode, the index file has three columns: Barcode1, Barcode2 and ID.
       Individual indexes can occur many times within the forward and reverse indexes, but  index
       pairs must be unique combinations.

   The Demultiplexing Statistics File
       The  -t  option  allows  the output of per-sample read counts to a tab-separated file. The
       file will have a header describing its format, and includes a line for reads  which  could
       not be demultiplexed.

AXE'S MATCHING ALGORITHM

       Axe  uses an algorithm based on longest-prefix-in-trie matching to match a variable length
       from the start of each read against a set of 'mutated' indexes.

   Hamming distance matching
       While for  most  applications  in  high-throughput  sequencing  hamming  distances  are  a
       frowned-upon  metric,  it  is  typical  for  HTS read indexes to be designed to tolerate a
       certain level of hamming mismatches. Given these sequences are short and  typically  occur
       at  the  5'  end  of  reads,  insertions  and deletions rarely need be considered, and the
       increased rate of assignment of reads with many errors is offset by the  risk  of  falsely
       assigning indexes to an incorrect sample. In any case, reads with more than 1-2 sequencing
       errors in their first several bases are likely to be poor  quality,  and  will  simply  be
       filtered out during downstream quality control.

   Hamming mismatch tries
       Typically,  reads  are  matched  to  a  set of indexes by calculating the hamming distance
       between the index, and the first l bases of a read for a index of length l. The  "correct"
       index  is  then selected by recording either the index with the lowest hamming distance to
       the read (competitive matching) or by simply accepting the  first  index  with  a  hamming
       distance  below  a  certain  threshold.   These  approaches  are both very computationally
       expensive, and can have  lower  accuracy  than  the  algorithm  I  propose.  Additionally,
       implementations   of   these  methods  rarely  handle  indexes  of  differing  length  and
       combinatorial indexing well, if at all.

       Central to Axe's algorithm is the concept of hamming-mismatch tries. A  trie  is  a  N-ary
       tree  for  an  N letter alphabet. In the case of high-throughput sequencing reads, we have
       the alphabet AGCT, corresponding to the four nucleotides of DNA, plus N, used to represent
       ambiguous  base  calls.  Instead of matching each index to each read, we pre-calculate all
       allowable sequences at each mismatch level, and  store  these  in  level-wise  tries.  For
       example,  to  match  to a hamming distance of 2, we create three tries: One containing all
       indexes, verbatim, and two tries where every sequence within a hamming distance of 1 and 2
       of  each  index respectively. Hereafter, these tries are referred to  as the 0, 1 and 2-mm
       tries, for a hamming distance (mismatch) of 0, 1 and 2. Then, we find the  longest  prefix
       in each sequence read in the 0mm trie. If this prefix is not a valid leaf in the 0mm trie,
       we find the longest prefix in the 1mm trie, and so on for all tries in ascending order. If
       no  prefix  of  the  read  is  a complete sequence in any trie, the read is assigned to an
       "non-indexed" output file.

       This algorithm ensures optimal index matching in many ways, but is also extremely fast. In
       situations  with  indexes of differing length, we ensure that the longest acceptable index
       at a given hamming distance is chosen; assuming that sequence is random after  the  index,
       the  probability  of false assignments using this method is low. We also ensure that short
       perfect matches are preferred to longer inexact  matches,  as  we  firstly  only  consider
       indexes  with no error, then 1 error, and so on. This ensures that reads with indexes that
       are followed by random sequence that happens to inexactly match a longer index in the  set
       are not falsely assigned to this longer index.

       The  speed  of  this algorithm is largely due to the constant time matching algorithm with
       respect to the number of  indexes  to  match.  The  time  taken  to  match  each  read  is
       proportional instead to the length of the indexes, as for a index of length l, at most l +
       1 trie level descents are required to find an  entry  in  the  trie.  As  this  length  is
       more-or-less  constant  and small, the overall complexity of axe's algorithm is O(n) for n
       reads, as opposed to O(nm) for n reads  and  m  indexes  as  is  typical  for  traditional
       matching algorithms

       • genindex

AUTHOR

       Kevin Murray

       2021, Kevin Murray