Provided by: swarm_2.2.2+dfsg-1_amd64 bug

NAME

       swarm — find clusters of nearly-identical nucleotide amplicons

SYNOPSIS

       swarm [ options ] filename

DESCRIPTION

       Environmental  or  clinical  molecular studies generate large volumes of amplicons (e.g., 16S or 18S SSU-
       rRNA sequences) that need to be clustered into  molecular  operational  taxonomic  units  (OTUs).  Common
       clustering  methods  are  based  on greedy, input-order dependent algorithms, with arbitrary selection of
       global cluster size and cluster centroids. To address that problem, we developed swarm, a fast and robust
       method that recursively groups amplicons with d or less differences (i.e.  substitutions,  insertions  or
       deletions).  swarm produces natural and stable clusters centered on local peaks of abundance, mostly free
       from input-order dependency induced by centroid selection.

       Exact clustering is impractical on large data sets when using a naïve all-vs-all approach (more precisely
       a 2-combination without repetitions), as it implies unrealistic numbers of pairwise comparisons. swarm is
       based on a maximum number of differences d between two amplicons, and focuses only on  very  close  local
       relationships.  For d = 1, the default value, swarm uses an algorithm of linear complexity that generates
       all possible single mutations and performs exact-string matching by comparing hash-values. For d =  2  or
       greater,  swarm  uses  an algorithm of quadratic complexity that performs pairwise string comparisons. An
       efficient k-mer-based filtering and an astute use of comparisons results obtained during  the  clustering
       process  allows  swarm  to avoid most of the amplicon comparisons needed in a naïve approach. To speed up
       the remaining amplicon comparisons, swarm implements an extremely fast Needleman-Wunsch algorithm  making
       use  of  the  Streaming  SIMD  Extensions  (SSE2)  of  modern  x86-64  CPUs. If SSE2 instructions are not
       available, swarm exits with an error message.

       swarm can read nucleotide amplicons in fasta format from a normal file or from the standard input  (using
       a  pipe  or  a  redirection).  The amplicon identifier is defined as the string comprised between the '>'
       symbol and the first space or the end of the line, whichever comes  first.  As  swarm  outputs  lists  of
       amplicon  identifiers,  amplicon identifiers must be unique to avoid ambiguity; swarm exits with an error
       message if identifiers are not unique.  Amplicon identifiers must end with a '_' followed by  a  positive
       integer representing the amplicon copy number (or abundance annotation; usearch/vsearch users can use the
       option  -z  to  use the ';size=' annotation). Abundance annotations play a crucial role in the clustering
       process, and swarm exits with an error message  if  that  information  is  not  available.  The  amplicon
       sequence  is  defined as a string of [ACGT] or [ACGU] symbols (case insensitive, 'U' is replaced with 'T'
       internally), starting after the end of the identifier line and ending before the next identifier line  or
       the  file  end;  swarm silently removes newline symbols ('\n' or '\r') and exits with an error message if
       any other symbol is present.

   General options
       -h, --help
                display this help and exit successfully.

       -t, --threads positive integer
                number of computation threads to use. Values between 1 and 256 are accepted, but we recommend to
                use a number of threads lesser or equal to the number of available CPU cores. Default number  of
                threads is 1.

       -v, --version
                output version information and exit successfully.

       --       delimit  the  option  list.  Later arguments, if any, are treated as operands even if they begin
                with '-'. For example, 'swarm -- -file.fasta' reads from the file '-file.fasta'.

   Clustering options
       -d, --differences zero or positive integer
                maximum number of differences allowed between two amplicons, meaning that two amplicons will  be
                grouped  if  they  have integer (or less) differences. This is swarm's most important parameter.
                The number of differences is calculated as the number of mismatches  (substitutions,  insertions
                or  deletions)  between  the  two  amplicons once the optimal pairwise global alignment has been
                found (see 'pairwise alignment advanced options' to influence that step).  Any integer from 0 to
                255 can be used, but high d values will decrease the taxonomical resolution  of  swarm  results.
                Commonly used d values are 1, 2 or 3, rarely higher. When using d = 0, swarm will output results
                corresponding  to  a  strict  dereplication  of  the  dataset, i.e. merging identical amplicons.
                Warning, whatever the d value, swarm requires fasta entries to present abundance values. Default
                number of differences d is 1.

       -n, --no-otu-breaking
                deactivate the built-in OTU refinement (not recommended). Amplicon abundance values are used  to
                identify  transitions  among  in-contact  OTUs  and to separate them, yielding higher-resolution
                clustering results. That option prevents that separation, and in practice, allows  the  creation
                of  a link between amplicons A and B, even if the abundance of B is higher than the abundance of
                A.

   Fastidious options
       -b, --boundary positive integer
                when using the option --fastidious (-f), define the minimum mass of a large OTU. By default,  an
                OTU  with  a mass of 3 or more is considered large. Conversely, an OTU is small if it has a mass
                of less than 3, meaning that it is composed of either  one  amplicon  of  abundance  2,  or  two
                amplicons  of  abundance  1.  Any  positive  value greater than 1 can be specified. Using higher
                boundary values will speed up the second pass, but also reduce  the  taxonomical  resolution  of
                swarm results. Default mass of a large OTU is 3.

       -c, --ceiling positive integer
                when using the option --fastidious (-f), define swarm's maximum memory footprint (in megabytes).
                swarm  will  adjust  the --bloom-bits (-y) value of the Bloom filter to fit within the specified
                amount of memory. The value must be at least 3.

       -f, --fastidious
                when working with d = 1, perform a second clustering pass to reduce the  number  of  small  OTUs
                (recommended  option). During the first clustering pass, an intermediate amplicon can be missing
                for purely stochastic reasons, interrupting the aggregation process. The fastidious option  will
                create  virtual  amplicons, allowing to graft small OTUs upon bigger ones. By default, an OTU is
                small if it has a mass of 2 or less (see the --boundary option to modify that value).  To  speed
                things  up,  swarm  uses  a  Bloom  filter  to  store  intermediate results. Warning, the second
                clustering pass can be 2 to 3 times slower than the first pass and requires much more memory  to
                store  the  virtual  amplicons  in Bloom filters. See the options --bloom-bits (-y) or --ceiling
                (-c) to control the memory footprint  of  the  Bloom  filter.  The  fastidious  option  modifies
                clustering  results:  the  output  files produced by the options --log (-l), --output-file (-o),
                --mothur (-r), --uclust-file, and --seeds (-w) are updated to reflect these  modifications;  the
                file  --statistics-file  (-s) is partially updated (columns 6 and 7 are not updated); the output
                file --internal-structure (-i) is partially updated (column 5 is not updated for amplicons  that
                belonged to the small OTU).

       -y, --bloom-bits positive integer
                when  using  the  option --fastidious (-f), define the size (in bits) of each entry in the Bloom
                filter. That option allows to balance the efficiency (i.e. speed) and the  memory  footprint  of
                the  Bloom  filter. Large values will make the Bloom filter more efficient but will require more
                memory. Any value between 2 and 64 can be used. Default value is  16.  See  the  --ceiling  (-c)
                option for an alternative way to control the memory footprint.

   Input/output options
       -a, --append-abundance positive integer
                set  abundance  value  to use when some or all amplicons in the input file lack abundance values
                (_integer, or ;size=integer; when using -z). Warning, it is not  recommended  to  use  swarm  on
                datasets  where  abundance  values  are  all  identical. We provide that option as a courtesy to
                advanced users, please use it carefully. swarm exits with an error message if  abundance  values
                are missing and if this option is not used.

       -i, --internal-structure filename
                output  all  pairs  of nearly-identical amplicons to filename using a five-columns tab-delimited
                format:

                       1.  amplicon A label.

                       2.  amplicon B label.

                       3.  number of differences between amplicons A and B (positive integer).

                       4.  OTU number (positive integer). OTUs are  numbered  in  their  order  of  delineation,
                           starting  from  1.  All pairs of amplicons belonging to the same OTU will receive the
                           same number.

                       5.  cummulated number of steps from the OTU seed to amplicon B (positive  integer).  When
                           using  the  option  --fastidious  (-f),  the  actual  number of steps between grafted
                           amplicons and the OTU seed cannot be re-computed efficiently and is always set  to  2
                           for  the  amplicon  pair  linking  the small OTU to the big OTU. Cummulated number of
                           steps in the small OTU (if any) are left unchanged.

       -l, --log filename
                output all messages to filename instead of standard error, with the exception of error  messages
                of  course.  That  option is useful in situations where writing to standard error is problematic
                (for example, with certain job schedulers).

       -o, --output-file filename
                output clustering results to filename. Results consist of a list of OTUs, one OTU per  line.  An
                OTU is a list of amplicon identifiers separated by spaces. That output format can be modified by
                the option --mothur (-r). Default is to write to standard output.

       -r, --mothur
                output  clustering  results  in  a  format  compatible with Mothur. That option modifies swarm's
                default output format.

       -s, --statistics-file filename
                output statistics to filename. The file is a tab-separated table with one OTU per row and  seven
                columns of information:

                       1.  number of unique amplicons in the OTU,

                       2.  total abundance of amplicons in the OTU,

                       3.  identifier of the initial seed,

                       4.  initial seed abundance,

                       5.  number of amplicons with an abundance of 1 in the OTU,

                       6.  maximum number of iterations before the OTU reached its natural limit,

                       7.  cummulated  number  of  steps  along  the  path  joining the seed and the furthermost
                           amplicon in the OTU. Please note that the actual number of  differences  between  the
                           seed  and  the  furthermost  amplicon  is usually much smaller. When using the option
                           --fastidious (-f), grafted amplicons are not taken into account.

       -u, --uclust-file filename
                output clustering results in filename using a tab-separated uclust-like format with  10  columns
                and  3 different type of entries (S, H or C). That option does not modify swarm's default output
                format. Each fasta sequence in the input file can be either a cluster centroid (S) or a hit  (H)
                assigned to a cluster. Cluster records (C) summarize information (size, centroid label) for each
                cluster. Column content varies with the type of entry (S, H or C):

                       1.  Record type: S, H, or C.

                       2.  Cluster number (zero-based).

                       3.  Centroid length (S), query length (H), or cluster size (C).

                       4.  Percentage of similarity with the centroid sequence (H), or set to '*' (S, C).

                       5.  Match orientation + or - (H), or set to '*' (S, C).

                       6.  Not used, always set to '*' (S, C) or to zero (H).

                       7.  Not used, always set to '*' (S, C) or to zero (H).

                       8.  set  to  '*' (S, C) or, for H, compact representation of the pairwise alignment using
                           the CIGAR format (Compact  Idiosyncratic  Gapped  Alignment  Report):  M  (match),  D
                           (deletion)  and  I  (insertion).  The  equal  sign  '='  indicates  that the query is
                           identical to the centroid sequence.

                       9.  Label of the query sequence (H), or of the centroid sequence (S, C).

                       10. Label of the centroid sequence (H), or set to '*' (S, C).

       -w, --seeds filename
                output OTU representatives to filename  in  fasta  format.  The  abundance  value  of  each  OTU
                representative is the sum of the abundances of all the amplicons in the OTU.

       -z, --usearch-abundance
                accept  amplicon  abundance  values  in  usearch/vsearch's  style (>label;size=integer[;]). That
                option influences the abundance annotation style used in output files.

   Pairwise alignment advanced options
       when using d > 1, swarm recognizes advanced command-line options modifying the pairwise global  alignment
       scoring parameters:

              -m, --match-reward positive integer
                       Default reward for a nucleotide match is 5.

              -p, --mismatch-penalty positive integer
                       Default penalty for a nucleotide mismatch is 4.

              -g, --gap-opening-penalty positive integer
                       Default gap opening penalty is 12.

              -e, --gap-extension-penalty positive integer
                       Default gap extension penalty is 4.

       As  swarm focuses on close relationships (e.g., d = 2 or 3), clustering results are resilient to pairwise
       alignment model parameters modifications. When  clustering  using  a  higher  d  value,  modifying  model
       parameters has a stronger impact.

EXAMPLES

       Clusterize  the data set myfile.fasta into OTUs with the finest resolution possible (1 difference, built-
       in breaking, fastidious option) using 4 computation threads. OTUs are written to the file  myfile.swarms,
       and OTU representatives are written to myfile.representatives.fasta.

              swarm -t 4 -f -w myfile.representatives.fasta < myfile.fasta > myfile.swarms

AUTHORS

       Concept by Frédéric Mahé, implementation by Torbjørn Rognes.

CITATION

       Mahé  F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2014) Swarm: robust and fast clustering method for
       amplicon-based studies.  PeerJ 2:e593 <https://doi.org/10.7717/peerj.593>

       Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2015) Swarm v2: highly-scalable and high-resolution
       amplicon clustering.  PeerJ 3:e1420 <https://doi.org/10.7717/peerj.1420>

REPORTING BUGS

       Submit suggestions and bug-reports at <https://github.com/torognes/swarm/issues>, send a pull request  on
       <https://github.com/torognes/swarm>,  or  compose  a  friendly  or  curmudgeonly  e-mail to Frédéric Mahé
       <mahe@rhrk.uni-kl.de> and Torbjørn Rognes <torognes@ifi.uio.no>.

AVAILABILITY

       Source code and binaries are available at <https://github.com/torognes/swarm>

COPYRIGHT

       Copyright (C) 2012-2017 Frédéric Mahé & Torbjørn Rognes

       This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
       General Public License as published by the Free Software Foundation, either version 3 of the License,  or
       any later version.

       This  program  is  distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even
       the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU  Affero  General
       Public License for more details.

       You  should  have  received  a copy of the GNU Affero General Public License along with this program.  If
       not, see <http://www.gnu.org/licenses/>.

SEE ALSO

       swipe, an extremely  fast  Smith-Waterman  database  search  tool  by  Torbjørn  Rognes  (available  from
       <https://github.com/torognes/swipe>).

       vsearch,  an  open-source re-implementation of the classic uclust clustering method (by Robert C. Edgar),
       along with other amplicon filtering and searching tools. vsearch is implemented by  Torbjørn  Rognes  and
       documented by Frédéric Mahé, and is available at <https://github.com/torognes/vsearch>.

VERSION HISTORY

       New features and important modifications of swarm (short lived or minor bug releases are not mentioned):

              v2.2.2 released December 12, 2017
                     Version  2.2.2  fixes  a bug that would cause Swarm to wait forever in very rare cases when
                     multiple threads were used.

              v2.2.1 released October 27, 2017
                     Version 2.2.1 fixes a memory allocation bug for d = 1 and duplicated sequences.

              v2.2.0 released October 17, 2017
                     Version 2.2.0 fixes several problems and improves usability. Corrected output to  structure
                     and  uclust  files  when  using  fastidious mode. Corrected abundance output in some cases.
                     Added check for duplicated sequences and fixed check for duplicated  sequence  IDs.  Checks
                     for  empty  sequences.  Sorts sequences by additional fields to improve stability. Improves
                     compatibility with compilers and operating  systems.   Outputs  sequences  in  upper  case.
                     Allows  64-bit  abundances. Shows message when waiting for input from stdin. Improves error
                     messages and warnings. Improves checking of command line options.  Fixes  remaining  errors
                     reported by test suite. Updates documentation.

              v2.1.13 released March 8, 2017
                     Version 2.1.13 removes a bug with the progress bar when writing seeds.

              v2.1.12 released January 16, 2017
                     Version 2.1.12 removes a debugging message.

              v2.1.11 released January 16, 2017
                     Version  2.1.11  fixes  two bugs related to the SIMD implementation of alignment that might
                     result in incorrect alignments and scores.  The bug only applies when d > 1.

              v2.1.10 released December 22, 2016
                     Version 2.1.10 fixes two bugs related to gap penalties of alignments.  The  first  bug  may
                     lead  to  wrong  aligments  and  similarity percentages reported in UCLUST (.uc) files. The
                     second bug makes Swarm use a slightly higher gap  extension  penalty  than  specified.  The
                     default gap extension penalty used have actually been 4.5 instead of 4.

              v2.1.9 released July 6, 2016
                     Version 2.1.9 fixes errors when compiling with GCC version 6.

              v2.1.8 released March 11, 2016
                     Version  2.1.8  fixes  a  rare bug triggered when clustering extremely short undereplicated
                     sequences. Also, alignment parameters are not shown when d = 1.

              v2.1.7 released February 24, 2016
                     Version 2.1.7 fixes a bug in the output of seeds with the -w option when d > 1 that was not
                     properly fixed in version 2.1.6. It also handles ascii character #13 (CR)  in  FASTA  files
                     better. Swarm will now exit with status 0 if the -h or the -v option is specified. The help
                     text and some error messages have been improved.

              v2.1.6 released December 14, 2015
                     Version  2.1.6  fixes problems with older compilers that do not have the x86intrin.h header
                     file. It also fixes a bug in the output of seeds with the -w option when d > 1.

              v2.1.5 released September 8, 2015
                     Version 2.1.5 fixes minor bugs.

              v2.1.4 released September 4, 2015
                     Version 2.1.4 fixes minor bugs in the swarm algorithm used for d = 1.

              v2.1.3 released August 28, 2015
                     Version 2.1.3 adds checks of numeric option arguments.

              v2.1.1 released March 31, 2015
                     Version 2.1.1 fixes a bug with  the  fastidious  option  that  caused  it  to  ignore  some
                     connections between large and small OTUs.

              v2.1.0 released March 24, 2015
                     Version 2.1.0 marks the first official release of swarm v2.

              v2.0.7 released March 18, 2015
                     Version 2.0.7 writes abundance information in usearch style when using options -w (--seeds)
                     in combination with -z (--usearch-abundance).

              v2.0.6 released March 13, 2015
                     Version 2.0.6 fixes a minor bug.

              v2.0.5 released March 13, 2015
                     Version  2.0.5  improves  the  implementation  of the fastidious option and adds options to
                     control memory usage of the Bloom filter (-y and -c).  In addition, an option  (-w)  allows
                     to  output  OTU  representatives  sequences  with updated abundances (sum of all abundances
                     inside each OTU). This version also enables swarm to run with d = 0.

              v2.0.4 released March 6, 2015
                     Version 2.0.4 includes a fully parallelised implementation of the fastidious option.

              v2.0.3 released March 4, 2015
                     Version 2.0.3 includes a working implementation of the  fastidious  option,  but  only  the
                     initial clustering is parallelized.

              v2.0.2 released February 26, 2015
                     Version 2.0.2 fixes SSSE3 problems.

              v2.0.1 released February 26, 2015
                     Version  2.0.1  is  a  development  version  that  contains a partial implementation of the
                     fastidious option, but it is not usable yet.

              v2.0.0 released December 3, 2014
                     Version  2.0.0  is   faster   and   easier   to   use,   providing   new   output   options
                     (--internal-structure   and   --log),   new   control  options  (--boundary,  --fastidious,
                     --no-otu-breaking), and built-in OTU refinement (no need to use the python script anymore).
                     When using default parameters, a novel and  considerably  faster  algorithmic  approach  is
                     used, guaranteeing swarm's scalability.

              v1.2.21 released February 26, 2015
                     Version  1.2.21  is  supposed  to  fix  some  problems  related to the use of the SSSE3 CPU
                     instructions which are not always available.

              v1.2.20 released November 6, 2014
                     Version 1.2.20 presents a production-ready version of  the  alternative  algorithm  (option
                     -a), with optional built-in OTU breaking (option -n). That alternative algorithmic approach
                     (usable  only with d = 1) is considerably faster than currently used clustering algorithms,
                     and can deal with datasets of 100 million unique amplicons or  more  in  a  few  hours.  Of
                     course,  results  are  rigourously identical to the results previously produced with swarm.
                     That release also introduces new options to control swarm output (options -i and -l).

              v1.2.19 released October 3, 2014
                     Version 1.2.19  fixes  a  problem  related  to  abundance  information  when  the  sequence
                     identifier includes multiple underscore characters.

              v1.2.18 released September 29, 2014
                     Version 1.2.18 reenables the possibility of reading sequences from stdin if no file name is
                     specified on the command line. It also fixes a bug related to CPU features detection.

              v1.2.17 released September 28, 2014
                     Version 1.2.17 fixes a memory allocation bug introduced in version 1.2.15.

              v1.2.16 released September 27, 2014
                     Version 1.2.16 fixes a bug in the abundance sort introduced in version 1.2.15.

              v1.2.15 released September 27, 2014
                     Version  1.2.15  sorts the input sequences in order of decreasing abundance unless they are
                     detected to be sorted already. When using the alternative algorithm for d = 1 it also sorts
                     all subseeds in order of decreasing abundance.

              v1.2.14 released September 27, 2014
                     Version 1.2.14 fixes a bug in the output with the --swarm_breaker option  (-b)  when  using
                     the alternative algorithm (-a).

              v1.2.12 released August 18, 2014
                     Version  1.2.12  introduces  an  option  --alternative-algorithm  to use an extremely fast,
                     experimental clustering algorithm for the special case d = 1. Multithreading scalability of
                     the default algorithm has been noticeably improved.

              v1.2.10 released August 8, 2014
                     Version 1.2.10 allows amplicon abundances to be specified using the usearch  style  in  the
                     sequence header (e.g. '>id;size=1') when the -z option is chosen.

              v1.2.8 released August 5, 2014
                     Version  1.2.8  fixes an error with the gap extension penalty. Previous versions used a gap
                     penalty twice as large as intended. That bug correction induces small changes in clustering
                     results.

              v1.2.6 released May 23, 2014
                     Version 1.2.6 introduces an option --mothur  to  output  clustering  results  in  a  format
                     compatible   with   the   microbial   ecology  community  analysis  software  suite  Mothur
                     (<http://www.mothur.org/>).

              v1.2.5 released April 11, 2014
                     Version 1.2.5 removes the need for a POPCNT hardware instruction to be present.  swarm  now
                     automatically  checks  whether  POPCNT  is  available  and  uses a slightly slower software
                     implementation if not. Only basic SSE2 instructions are now required to run swarm.

              v1.2.4 released January 30, 2014
                     Version 1.2.4 introduces an option --break-swarms to output all pairs of amplicons  with  d
                     differences   to   standard   error.   That   option   is  used  by  the  companion  script
                     `swarm_breaker.py` to refine swarm results. The syntax  of  the  inline  assembly  code  is
                     changed for compatibility with more compilers.

              v1.2 released May 16, 2013
                     Version  1.2  greatly improves speed by using alignment-free comparisons of amplicons based
                     on k-mer word content. For each amplicon, the presence-absence of all  possible  5-mers  is
                     computed  and  recorded  in  a  1024-bits vector. Vector comparisons are extremely fast and
                     drastically reduce the number of costly  pairwise  alignments  performed  by  swarm.  While
                     remaining  exact,  swarm 1.2 can be more than 100-times faster than swarm 1.1, when using a
                     single thread with a large set of  sequences.  The  minor  version  1.1.1,  published  just
                     before,  adds  compatibility  with  Apple  computers, and corrects an issue in the pairwise
                     global alignment step that could lead to sub-optimal alignments.

              v1.1 released February 26, 2013
                     Version 1.1 introduces two new important options:  the  possibility  to  output  clustering
                     results  using  the uclust output format, and the possibility to output detailed statistics
                     on each OTU. swarm 1.1 is also faster: new filterings based on pairwise  amplicon  sequence
                     lengths  and  composition  comparisons  reduce the number of pairwise alignments needed and
                     speed up the clustering.

              v1.0 released November 10, 2012
                     First public release.

version 2.2.2                                   December 12, 2017                                       swarm(1)