Provided by: swarm_2.1.6-1_amd64 bug

NAME

       swarm — find clusters of nearly-identical nucleotide amplicons

SYNOPSIS

       swarm [ options ] filename

DESCRIPTION

       Environmental  or  clinical  molecular studies generate large volumes of amplicons (e.g., 16S or 18S SSU-
       rRNA sequences) that need to be clustered into  molecular  operational  taxonomic  units  (OTUs).  Common
       clustering  methods  are  based  on greedy, input-order dependent algorithms, with arbitrary selection of
       global cluster size and cluster centroids. To address that problem, we developed swarm, a fast and robust
       method that recursively groups amplicons with d or less differences. swarm produces  natural  and  stable
       clusters  centered  on  local  peaks  of  abundance,  free  from  centroid  selection induced input-order
       dependency.

       Exact clustering is impractical on large data sets when using a naïve all-vs-all approach (more precisely
       a 2-combination without repetitions), as it implies unrealistic numbers of pairwise comparisons. swarm is
       based on a maximum number of differences d between two amplicons, and focuses only on  very  close  local
       relationships.  For  d  =  1  (default value), swarm uses an algorithm of linear complexity that performs
       exact-string matching by comparing hash-values. For d  =  2  or  greater,  swarm  uses  an  algorithm  of
       quadratic complexity that performs pairwise string comparisons. An efficient k-mer-based filtering and an
       astute  use  of  comparisons  results  obtained during the clustering process allows to avoid most of the
       amplicon comparisons needed in a naïve approach. To speed up the remaining  amplicon  comparisons,  swarm
       implements  an  extremely  fast  Needleman-Wunsch  algorithm  making use of the Streaming SIMD Extensions
       (SSE2) of modern x86-64 CPUs. If SSE2 instructions are not available, swarm exits with an error message.

       swarm reads the named input filename, a fasta file of nucleotide amplicons. The  amplicon  identifier  is
       defined  as  the  string  comprised  between  the  ">" symbol and the first space or the end of the line,
       whichever comes first. As swarm outputs lists of  amplicon  identifiers,  amplicon  identifiers  must  be
       unique  to  avoid  ambiguity;  swarm  exits with an error message if identifiers are not unique. Amplicon
       identifiers must end with a "_" followed by a positive integer representing the amplicon copy number  (or
       abundance  annotation;  usearch/vsearch  users  can use the option -z to change that behavior). Abundance
       annotations play a crucial role in the clustering process, and swarm exits with an error message if  that
       information  is  not  available. The amplicon sequence is defined as a string of [acgt] or [acgu] symbols
       (case insensitive), starting after the end of the identifier line and ending before the  next  identifier
       line or the file end; swarm exits with an error message if any other symbol is present.

   General options
       -b, --boundary positive integer
                when  using  the  option --fastidious (-f), define the minimum mass of a large OTU as the number
                given with this option. The default value is 3, indicating that any OTU with mass 3 or  more  is
                considered  "large".   By default, an OTU is "small" if it has a mass of 2 or less, meaning that
                it is composed of either one amplicon of abundance 2, or  two  amplicons  of  abundance  1.  Any
                positive  value  greater than 1 can be specified. Using higher boundary values will speed up the
                second pass, but also reduce the taxonomical resolution of swarm results.

       -c, --ceiling positive integer
                when using the option --fastidious (-f), define swarm's maximum memory footprint (in megabytes).
                swarm will adjust the --bloom-bits (-y) value of the Bloom filter to fit  within  the  specified
                amount of memory. That option is not active by default.

       -d, --differences zero or positive integer
                maximum  number of differences allowed between two amplicons, meaning that two amplicons will be
                grouped if they have integer (or less) differences. This is swarm's  most  important  parameter.
                The  number  of differences is calculated as the number of mismatches (substitutions, insertions
                or deletions) between the two amplicons once the optimal  pairwise  global  alignment  has  been
                found  (see "pairwise alignment advanced options" to influencing that step). Any integer between
                0 and 256 can be used, but high d values will  decrease  the  taxonomical  resolution  of  swarm
                results.  Commonly  used  d  values  are  1, 2 or 3, rarely higher. When using d = 0, swarm will
                output results corresponding to a strict dereplication of the dataset,  i.e.  merging  identical
                amplicons.  Warning,  swarm  still  requires  fasta entries to present abundance values. Default
                number of differences is 1.

       -f, --fastidious
                when working with d = 1, perform a second clustering pass to reduce the  number  of  small  OTUs
                (recommended  option). During the clustering process with d = 1, an intermediate amplicon can be
                missing for purely stochastic reasons, interrupting the aggregation process.  That  option  will
                create  virtual  amplicons, allowing to graft small OTUs upon bigger ones. By default, an OTU is
                "small" if it has a mass of 2 or less (see the --boundary option to  increase  that  value).  To
                speed  things  up, swarm uses a Bloom filter to store intermediate results. Warning, that second
                pass can be 2 to 3 times slower than the first pass and  requires  much  more  memory.  See  the
                options --bloom-bits (-y) or --ceiling (-c) to control the memory footprint of the Bloom filter.
                Warning,  the  fastidious  option  modifies clustering results. The output files produced by the
                options --log (-l), --output-file (-o), --mothur  (-r),  --uclust-file,  and  --seeds  (-w)  are
                updated  to  reflect  these  modifications; the file --statistics-file (-s) is partially updated
                (columns 6 and 7 are not updated); the output file --internal-structure (-i) is not updated.

       -h, --help
                display this help and exit.

       -n, --no-otu-breaking
                deactivate the built-in OTU refinement (not recommended). Amplicon abundance values are used  to
                identify  transitions  among  in-contact  OTUs  and to separate them, yielding higher-resolution
                clustering results. That option prevents that separation, and in practice, allows  the  creation
                of  a link between amplicons A and B, even if the abundance of B is higher than the abundance of
                A.

       -t, --threads positive integer
                number of computation threads to use. The number of threads should be lesser  or  equal  to  the
                number of available CPU cores. Default number of threads is 1.

       -v, --version
                output version information and exit.

       -y, --bloom-bits positive integer
                when  using  the  option --fastidious (-f), define the size (in bits) of each entry in the Bloom
                filter. That option allows to balance the efficiency (i.e. speed) and the  memory  footprint  of
                the  Bloom  filter. Large values will make the Bloom filter more efficient but will require more
                memory. Any value between 4 and 20 can be used. Default value is  16.  See  the  --ceiling  (-c)
                option for an alternative way to control the memory footprint.

   Input/output options
       -a, --append-abundance positive integer
                set  abundance  value to use when some or all amplicons in the input file lack abundance values.
                Warning, it is not recommended  to  use  swarm  on  datasets  where  abundance  values  are  all
                identical.  We  provide  that  option  as a courtesy to advanced users, please use it carefully.
                swarm exits with an error message if abundance values are missing and  if  this  option  is  not
                used.

       -i, --internal-structure filename
                output  all  pairs  of nearly-identical amplicons to filename using a five-columns tab-delimited
                format:

                       1.  amplicon A label.

                       2.  amplicon B label.

                       3.  number of differences between amplicons A and B (positive integer).

                       4.  OTU number (positive integer). OTUs are  numbered  in  their  order  of  delineation,
                           starting  from  1.  All pairs of amplicons belonging to the same OTU will receive the
                           same number.

                       5.  number of steps from the OTU seed to amplicon B (positive integer).

       -l, --log filename
                output all messages to filename instead of standard error, with the exception of error  messages
                of  course.  That  option is useful in situations where writing to standard error is problematic
                (for example, with certain job schedulers).

       -o, --output-file filename
                output clustering results to filename. Results consist of a list of OTUs, one OTU per  line.  An
                OTU  is  a  list  of  amplicon  identifiers separated by spaces. Default is to write to standard
                output.

       -r, --mothur
                output clustering results in a format compatible  with  Mothur.  That  option  modifies  swarm's
                default output format.

       -s, --statistics-file filename
                output  statistics to filename. The file is a tab-separated table with one OTU per row and seven
                columns of information:

                       1.  number of unique amplicons in the OTU,

                       2.  total copy number of amplicons in the OTU,

                       3.  identifier of the initial seed,

                       4.  initial seed copy number,

                       5.  number of amplicons with a copy number of 1 in the OTU,

                       6.  maximum number of iterations before the OTU reached its natural limits),

                       7.  theoretical maximum radius of the OTU (i.e., number of cummulated differences between
                           the seed and the furthermost amplicon in the OTU). The actual maximum radius  of  the
                           OTU is often much smaller.

       -u, --uclust-file filename
                output clustering results in uclust-like file format to the specified file. That option does not
                modify swarm's default output format.

       -w, --seeds filename
                output   OTU  representatives  to  filename  in  fasta  format.  The  abundance  value  of  each
                representative is the sum of the abundances of all the amplicons in the OTU.

       -z, --usearch-abundance
                accept amplicon abundance  values  in  usearch/vsearch's  style  (>label;size=integer[;]).  That
                option influences the abundance annotation style used in output files.

   Pairwise alignment advanced options
       when  using d > 1, swarm recognizes advanced command-line options modifying the pairwise global alignment
       scoring parameters:

              -m, --match-reward positive integer
                       set the reward for a nucleotide match. Default is 5.

              -p, --mismatch-penalty positive integer
                       set the penalty for a nucleotide mismatch. Default is 4.

              -g, --gap-opening-penalty positive integer
                       set the gap open penalty. Default is 12.

              -e, --gap-extension-penalty positive integer
                       set the gap extension penalty. Default is 4.

       As swarm focuses on close relationships (i.e. d = 2 or 3), clustering results are resilient  to  pairwise
       alignment  model  parameters  modifications.  Modifying  model  parameters  has  a  stronger  impact when
       clustering using a higher d value.

EXAMPLES

       Clusterize the data set myfile.fasta into OTUs with the finest resolution possible (1 difference,  built-
       in  breaking, fastidious option) using 4 computation threads. OTUs are written to the file myfile.swarms,
       and OTU representatives are written to myfile.representatives.fasta.

              swarm -t 4 -f -w myfile.representatives.fasta < myfile.fasta > myfile.swarms

AUTHORS

       Concept by Frédéric Mahé, implementation by Torbjørn Rognes.

CITATION

       Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2014) Swarm: robust and fast clustering method  for
       amplicon-based studies.  PeerJ 2:e593 <http://dx.doi.org/10.7717/peerj.593>

       Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. (2015) Swarm v2: highly-scalable and high-resolution
       amplicon clustering.  PeerJ 3:e1420 <http://dx.doi.org/10.7717/peerj.1420>

REPORTING BUGS

       Submit  suggestions and bug-reports at <https://github.com/torognes/swarm/issues>, send a pull request on
       <https://github.com/torognes/swarm>, or compose a  friendly  or  curmudgeonly  e-mail  to  Frédéric  Mahé
       <mahe@rhrk.uni-kl.de> and Torbjørn Rognes <torognes@ifi.uio.no>.

AVAILABILITY

       The software is available from <https://github.com/torognes/swarm>

COPYRIGHT

       Copyright (C) 2012, 2013, 2014, 2015 Frédéric Mahé & Torbjørn Rognes

       This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero
       General  Public License as published by the Free Software Foundation, either version 3 of the License, or
       any later version.

       This program is distributed in the hope that it will be useful, but WITHOUT ANY  WARRANTY;  without  even
       the  implied  warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General
       Public License for more details.

       You should have received a copy of the GNU Affero General Public License along  with  this  program.   If
       not, see <http://www.gnu.org/licenses/>.

SEE ALSO

       swipe,  an  extremely  fast  Smith-Waterman  database  search  tool  by  Torbjørn  Rognes (available from
       <https://github.com/torognes/swipe>).

       vsearch, an open-source re-implementation of the classic uclust clustering method (by Robert  C.  Edgar),
       along  with  other  amplicon filtering and searching tools. vsearch is implemented by Torbjørn Rognes and
       documented by Frédéric Mahé, and is available at <https://github.com/torognes/vsearch>.

VERSION HISTORY

       New features and important modifications of swarm (short lived or minor bug releases are not mentioned):

              v2.1.6 released December 14, 2015
                     Version 2.1.6 fixes problems with older compilers that do not have the  x86intrin.h  header
                     file. It also fixes a bug in the output of seeds with the `-w` option when d>1.

              v2.1.5 released September 8, 2015
                     Version 2.1.5 fixes minor bugs.

              v2.1.4 released September 4, 2015
                     Version 2.1.4 fixes minor bugs in the swarm algorithm used for d = 1.

              v2.1.3 released August 28, 2015
                     Version 2.1.3 adds checks of numeric option arguments.

              v2.1.1 released March 31, 2015
                     Version  2.1.1  fixes  a  bug  with  the  fastidious  option  that caused it to ignore some
                     connections between large and small OTUs.

              v2.1.0 released March 24, 2015
                     Version 2.1.0 marks the first official release of swarm v2.

              v2.0.7 released March 18, 2015
                     Version 2.0.7 writes abundance information in usearch style when using options -w (--seeds)
                     in combination with -z (--usearch-abundance).

              v2.0.6 released March 13, 2015
                     Version 2.0.6 fixes a minor bug.

              v2.0.5 released March 13, 2015
                     Version 2.0.5 improves the implementation of the fastidious  option  and  adds  options  to
                     control  memory  usage of the Bloom filter (-y and -c).  In addition, an option (-w) allows
                     to output OTU representatives sequences with updated  abundances  (sum  of  all  abundances
                     inside each OTU). This version also enables swarm to run with d = 0.

              v2.0.4 released March 6, 2015
                     Version 2.0.4 includes a fully parallelised implementation of the fastidious option.

              v2.0.3 released March 4, 2015
                     Version  2.0.3  includes  a  working  implementation of the fastidious option, but only the
                     initial clustering is parallelized.

              v2.0.2 released February 26, 2015
                     Version 2.0.2 fixes SSSE3 problems.

              v2.0.1 released February 26, 2015
                     Version 2.0.1 is a development version  that  contains  a  partial  implementation  of  the
                     fastidious option, but it is not usable yet.

              v2.0.0 released December 3, 2014
                     Version   2.0.0   is   faster   and   easier   to   use,   providing   new  output  options
                     (--internal-structure  and  --log),  new   control   options   (--boundary,   --fastidious,
                     --no-otu-breaking), and built-in OTU refinement (no need to use the python script anymore).
                     When  using  default  parameters,  a  novel and considerably faster algorithmic approach is
                     used, guaranteeing swarm's scalability.

              v1.2.21 released February 26, 2015
                     Version 1.2.21 is supposed to fix some problems  related  to  the  use  of  the  SSSE3  CPU
                     instructions which are not always available.

              v1.2.20 released November 6, 2014
                     Version  1.2.20  presents  a  production-ready version of the alternative algorithm (option
                     -a), with optional built-in OTU breaking (option -n). That alternative algorithmic approach
                     (usable only with d = 1) is considerably faster than currently used clustering  algorithms,
                     and  can  deal  with  datasets  of  100 million unique amplicons or more in a few hours. Of
                     course, results are rigourously identical to the results previously  produced  with  swarm.
                     That release also introduces new options to control swarm output (options -i and -l).

              v1.2.19 released October 3, 2014
                     Version  1.2.19  fixes  a  problem  related  to  abundance  information  when  the sequence
                     identifier includes multiple underscore characters.

              v1.2.18 released September 29, 2014
                     Version 1.2.18 reenables the possibility of reading sequences from stdin if no file name is
                     specified on the command line. It also fixes a bug related to CPU features detection.

              v1.2.17 released September 28, 2014
                     Version 1.2.17 fixes a memory allocation bug introduced in version 1.2.15.

              v1.2.16 released September 27, 2014
                     Version 1.2.16 fixes a bug in the abundance sort introduced in version 1.2.15.

              v1.2.15 released September 27, 2014
                     Version 1.2.15 sorts the input sequences in order of decreasing abundance unless  they  are
                     detected to be sorted already. When using the alternative algorithm for d = 1 it also sorts
                     all subseeds in order of decreasing abundance.

              v1.2.14 released September 27, 2014
                     Version  1.2.14  fixes  a bug in the output with the --swarm_breaker option (-b) when using
                     the alternative algorithm (-a).

              v1.2.12 released August 18, 2014
                     Version 1.2.12 introduces an option  --alternative-algorithm  to  use  an  extremely  fast,
                     experimental clustering algorithm for the special case d = 1. Multithreading scalability of
                     the default algorithm has been noticeably improved.

              v1.2.10 released August 8, 2014
                     Version  1.2.10  allows  amplicon abundances to be specified using the usearch style in the
                     sequence header (e.g. ">id;size=1") when the -z option is chosen.

              v1.2.8 released August 5, 2014
                     Version 1.2.8 fixes an error with the gap extension penalty. Previous versions used  a  gap
                     penalty twice as large as intended. That bug correction induces small changes in clustering
                     results.

              v1.2.6 released May 23, 2014
                     Version  1.2.6  introduces  an  option  --mothur  to  output clustering results in a format
                     compatible  with  the  microbial  ecology  community   analysis   software   suite   Mothur
                     (<http://www.mothur.org/>).

              v1.2.5 released April 11, 2014
                     Version  1.2.5  removes the need for a POPCNT hardware instruction to be present. swarm now
                     automatically checks whether POPCNT is  available  and  uses  a  slightly  slower  software
                     implementation if not. Only basic SSE2 instructions are now required to run swarm.

              v1.2.4 released January 30, 2014
                     Version  1.2.4  introduces an option --break-swarms to output all pairs of amplicons with d
                     differences  to  standard  error.  That  option   is   used   by   the   companion   script
                     `swarm_breaker.py`  to  refine  swarm  results.  The  syntax of the inline assembly code is
                     changed for compatibility with more compilers.

              v1.2 released May 16, 2013
                     Version 1.2 greatly improves speed by using alignment-free comparisons of  amplicons  based
                     on  k-mer  word  content. For each amplicon, the presence-absence of all possible 5-mers is
                     computed and recorded in a 1024-bits vector. Vector  comparisons  are  extremely  fast  and
                     drastically  reduce  the  number  of  costly  pairwise alignments performed by swarm. While
                     remaining exact, swarm 1.2 can be more than 100-times faster than swarm 1.1, when  using  a
                     single  thread  with  a  large  set  of  sequences. The minor version 1.1.1, published just
                     before, adds compatibility with Apple computers, and corrects  an  issue  in  the  pairwise
                     global alignment step that could lead to sub-optimal alignments.

              v1.1 released February 26, 2013
                     Version  1.1  introduces  two  new  important options: the possibility to output clustering
                     results using the uclust output format, and the possibility to output  detailed  statistics
                     on  each  OTU. swarm 1.1 is also faster: new filterings based on pairwise amplicon sequence
                     lengths and composition comparisons reduce the number of  pairwise  alignments  needed  and
                     speed up the clustering.

              v1.0 released November 10, 2012
                     First public release.

version 2.1.6                                   December 14, 2015                                       swarm(1)