bionic (1) vsearch.1.gz

Provided by: vsearch_2.7.1-1_amd64 bug

NAME

       vsearch  —  chimera  detection, clustering, dereplication and rereplication, FASTA/FASTQ file processing,
       masking, pairwise alignment, searching, shuffling, sorting and subsampling of amplicons for metagenomics,
       genomics, and population genetics.

SYNOPSIS

       Chimera detection:
              vsearch (--uchime_denovo | --uchime2_denovo | --uchime3_denovo) fastafile (--chimeras |
              --nonchimeras | --uchimealns | --uchimeout) outputfile [options]

              vsearch --uchime_ref fastafile (--chimeras | --nonchimeras | --uchimealns | --uchimeout)
              outputfile --db fastafile [options]

       Clustering:
              vsearch (--cluster_fast | --cluster_size | --cluster_smallmem | --cluster_unoise) fastafile
              (--alnout | --biomout | --blast6out | --centroids | --clusters | --mothur_shared_out | --msaout |
              --otutabout | --profile | --samout | --uc | --userout) outputfile --id real [options]

       Dereplication and rereplication:
              vsearch (--derep_fulllength | --derep_prefix) fastafile (--output | --uc) outputfile [options]

              vsearch --rereplicate fastafile --output outputfile [options]

       FASTA/FASTQ file processing:
              vsearch --fastq_chars fastqfile [options]

              vsearch --fastq_convert fastqfile --fastqout outputfile [options]

              vsearch (--fastq_eestats | --fastq_eestats2) fastqfile --output outputfile [options]

              vsearch --fastq_filter fastqfile (--fastaout | --fastaout_discarded | --fastqout |
              --fastqout_discarded) outputfile [options]

              vsearch --fastq_mergepairs fastqfile --reverse fastqfile (--fastaout | --fastqout |
              --fastaout_notmerged_fwd | --fastaout_notmerged_rev | --fastqout_notmerged_fwd |
              --fastqout_notmerged_rev | --eetabbedout) outputfile [options]

              vsearch --fastq_stats fastqfile [--log logfile] [options]

              vsearch --fastx_revcomp fastxfile (--fastaout | --fastqout) outputfile [options]

       Masking:
              vsearch --fastx_mask fastxfile (--fastaout | --fastqout) outputfile [options]

              vsearch --maskfasta fastafile --output outputfile [options]

       Pairwise alignment:
              vsearch --allpairs_global fastafile (--alnout | --blast6out | --matched | --notmatched | --samout
              | --uc | --userout) outputfile (--acceptall | --id real) [options]

       Searching:
              vsearch --search_exact fastafile --db fastafile (--alnout | --biomout | --blast6out |
              --mothur_shared_out | --otutabout | --samout | --uc | --userout) outputfile [options]

              vsearch --usearch_global fastafile --db fastafile (--alnout | --biomout | --blast6out |
              --mothur_shared_out | --otutabout | --samout | --uc | --userout) outputfile --id real [options]

       Shuffling and sorting:
              vsearch (--shuffle | --sortbylength | --sortbysize) fastafile --output outputfile [options]

       Subsampling:
              vsearch --fastx_subsample fastafile (--fastaout | --fastqout) outputfile (--sample_pct real |
              --sample_size positive integer) [options]

       UDB database handling:
              vsearch --makeudb_usearch fastafile --output outputfile [options]

              vsearch --udb2fasta udbfile --output outputfile [options]

              vsearch (--udbinfo | --udbstats) udbfile [options]

DESCRIPTION

       Environmental or clinical molecular diversity studies generate large volumes of amplicons (e.g.; SSU-rRNA
       sequences)  that  need  to  be checked for chimeras, dereplicated, masked, sorted, searched, clustered or
       compared to reference sequences. The aim of vsearch is to offer a all-in-one open source tool to  perform
       these  tasks,  using  optimized  algorithm  implementations  and  harvesting the full potential of modern
       computers, thus providing fast and accurate data processing.

       Comparing nucleotide sequences is at the core of vsearch. To speed up comparisons, vsearch implements  an
       extremely  fast  Needleman-Wunsch  algorithm,  making  use  of  the  Streaming  SIMD Extensions (SSE2) of
       post-2003 x86-64 CPUs.  If SSE2 instructions are not available, vsearch exits with an error  message.  On
       Power8  CPUs  it  will  use  AltiVec/VSX/VMX  instructions.  Memory usage increases rapidly with sequence
       length: for example comparing two sequences of length 1 kb requires  8  MB  of  memory  per  thread,  and
       comparing  two  10 kb sequences requires 800 MB of memory per thread. For comparisons involving sequences
       with a length product greater than 25 million (for example two sequences of length 5 kb), vsearch uses  a
       slower  alignment  method  described  by Hirschberg (1975) and Myers and Miller (1988), with much smaller
       memory requirements.

   Input
       vsearch accept as input fasta or fastq files containing one or  several  nucleotidic  entries.  In  fasta
       files,  each  nucleotidic  entry  is made of a header and a sequence. The header is defined as the string
       comprised between the '>' symbol and the first space, tab or the end of the line, whichever comes  first.
       Additionally,  if  the header matches integer as the number of occurrences (or abundance) of the sequence
       in the study. That abundance information  is  used  or  created  during  chimera  detection,  clustering,
       dereplication, sorting and searching.

       The  sequence  is  defined as a string of IUPAC symbols (ACGTURYSWKMDBHVN), starting after the end of the
       identifier line and ending before the next identifier line, or the file  end.  vsearch  silently  ignores
       ascii  characters  9  to 13, and exits with an error message if ascii characters 0 to 8, 14 to 31, '.' or
       '-' are present. All other ascii or non-ascii characters are stripped and complained about in  a  warning
       message.

       In  fastq files, each entry is made of sequence header starting with a symbol '@', a nucleotidic sequence
       (same rules as for fasta sequences), a quality header starting with a symbol '+' and a  string  of  ASCII
       characters  (offset  33  or 64), each one encoding the quality value of the corresponding position in the
       nucleotidic sequence.

       vsearch operations are case insensitive, except when soft masking is activated. Masking is  automatically
       applied  during chimera detection, clustering, masking, pairwise alignment and searching. Soft masking is
       specified with the options '--dbmask soft' (for searching and chimera  detection  with  a  reference)  or
       '--qmask  soft'  (for  searching,  de  novo  chimera  detection, clustering and masking). When using soft
       masking, lower case letters indicate masked symbols, while upper case letters indicate  regular  symbols.
       Masked symbols are never included in the unique index words used for sequence comparisons, otherwise they
       are treated as normal symbols.

       When comparing sequences during chimera detection, dereplication, searching and clustering, T and  U  are
       considered  identical, regardless of their case. If two symbols are not identical, their alignment result
       in a negative mismatch score  (default  -4),  except  if  one  or  both  of  the  symbols  are  ambiguous
       (RYSWKMDBHVN) in which case the score is zero. Alignment of two identical ambiguous symbols (for example,
       R vs R) also receives a score of zero.

       vsearch can read data from standard files and write to standard files, but it can also  read  from  pipes
       and  write to pipes! For example, multiple fasta files can be piped into vsearch for dereplication. To do
       so, file names can be replaced with:

              - the symbol '-', representing '/dev/stdin' for input files or '/dev/stdout' for output files,

              - a named pipe created with the command mkfifo,

              - a process substitution '<(command)' as input or '>(command)' as output.

       vsearch can automatically read compressed gzip or bzip2 files if the appropriate  libraries  are  present
       during  the  compilation.  vsearch  can  also  read  pipes streaming compressed gzip or bzip2 data if the
       options --gzip_decompress or --bzip2_decompress are selected. When reading  from  a  pipe,  the  progress
       indicator is not updated.

   Options
       vsearch  recognizes  a  large  number of command-line options. For easier navigation, options are grouped
       below by  theme  (chimera  detection,  clustering,  dereplication  and  rereplication,  FASTA/FASTQ  file
       processing,  masking,  pairwise alignment, searching, shuffling, sorting, and subsampling). We start with
       the general options that apply to all themes. Options may start with a single (-) or  double  dash  (--).
       Option names may be shortened as long as they are not ambiguous (e.g. --derep_f).

       General options:

              --bzip2_decompress
                       When  reading  from  a  pipe  streaming  bzip2-compressed data, decompress the data. That
                       option is not needed when reading from a standard bzip2-compressed file.

              --fasta_width positive integer
                       Fasta files produced by vsearch are wrapped (sequences are written on  lines  of  integer
                       nucleotides, 80 by default). Set that value to zero to eliminate the wrapping.

              --gzip_decompress
                       When reading from a pipe streaming gzip-compressed data, decompress the data. That option
                       is not needed when reading from a standard gzip-compressed file.

              --help | -h
                       Display help text and exit.

              --log filename
                       Write messages to the specified log file. Information written includes  program  version,
                       amount  of  memory  available,  number of cores and command line options, and if need be,
                       informational messages, warnings and fatal errors. The start and finish  times  are  also
                       recorded  as  well  as  the  elapsed  time and the maximum amount of memory consumed. The
                       different vsearch commands can also write additional informations to the log file.

              --maxseqlength positive integer
                       All vsearch operations discard sequences of length equal or greater than integer  (50,000
                       nucleotides by default).

              --minseqlength positive integer
                       All  vsearch operations discard sequences of length smaller than integer: 1 nucleotide by
                       default for sorting  or  shuffling,  32  nucleotides  for  clustering,  dereplication  or
                       searching.

              --no_progress
                       Do not show the gradually increasing progress indicator.

              --notrunclabels
                       Do  not  truncate  sequence  labels  at first space or tab, use the full header in output
                       files.

              --quiet  Suppress all messages to stdout and stderr except for warnings and fatal error messages.

              --threads positive integer
                       Number of computation threads to use (1 to 256). The number of threads should  be  lesser
                       or  equal  to  the  number  of  available  CPU cores. The default is to use all available
                       resources and to launch one thread per logical core. The following  commands  are  multi-
                       threaded:      allpairs_global,     cluster_fast,     cluster_size,     cluster_smallmem,
                       fastq_mergepairs, maskfasta,  search_exact,  uchime_ref,  and  usearch_global.  Only  one
                       thread is used for the other commands.

              --version | -v
                       Output version information and exit.

       Chimera detection options:

              Chimera  detection  is  based  on a scoring function controlled by five options (--dn, --mindiffs,
              --mindiv, --minh, --xn). Sequences are first sorted by decreasing  abundance,  if  available,  and
              compared on their plus strand only (case insensitive).

              Input  sequences  are  masked as specified with the --qmask and --hardmask options. Masking of the
              database for reference based chimera detection is specified with the --dbmask option.

              In de  novo  mode,  input  fasta  file  should  present  abundance  annotations  (i.e.  a  pattern
              [;]size=integer[;]  in  the  fasta  header).  Input  order  matters  for  chimera detection, so we
              recommend to sort sequences by decreasing abundance (default of  --derep_fulllength  command).  If
              your sequence set needs to be sorted, please see the --sortbysize command in the sorting section.

              --abskew real
                       When  using  --uchime_denovo,  the  abundance  skew is used to distinguish in a three-way
                       alignment which sequence is the chimera and which are the parents. The assumption is that
                       chimeras  appear  later  in the PCR amplification process and are therefore less abundant
                       than their parents. For --uchime3_denovo  the  default  value  is  16.0.  For  the  other
                       commands,  the  default  value  is 2.0, which means that the parents should be at least 2
                       times more abundant than their chimera. Any positive value equal or greater than 1.0  can
                       be used.

              --alignwidth positive integer
                       When  using  --uchimealns,  set  the width of the three-way alignments (80 nucleotides by
                       default). Set to zero to eliminate wrapping.

              --borderline filename
                       Output borderline chimeric sequences to filename, in fasta  format.  Borderline  chimeric
                       sequences  are  sequences  that  have  a high enough score but which are not sufficiently
                       different from their closest parent.

              --chimeras filename
                       Output chimeric sequences to filename, in fasta format. Output order may vary when  using
                       multiple threads.

              --db filename
                       When  using  --uchime_ref,  detect chimeras using the fasta-formatted reference sequences
                       contained in filename. Reference sequences  are  assumed  to  be  chimera-free.  Chimeras
                       cannot  be detected if their parents, or sufficiently close relatives, are not present in
                       the database.

              --dn real
                       No vote pseudo-count, corresponding to the parameter n in the  chimera  scoring  function
                       (default value is 1.4).

              --fasta_score
                       Add the chimera score to the headers in the fasta output files for chimeras, non-chimeras
                       and borderline sequences, using the format

              --mindiffs positive integer
                       Minimum number of differences per segment (default value is 3). The parameter is  ignored
                       with --uchime2_denovo and --uchime3_denovo.

              --mindiv real
                       Minimum  divergence  from closest parent (default value is 0.8). The parameter is ignored
                       with --uchime2_denovo and --uchime3_denovo.

              --minh real
                       Minimum score (h). Increasing this value tends to reduce the number  of  false  positives
                       and  to  decrease  sensitivity. Default value is 0.28, and values ranging from 0.0 to 1.0
                       included  are  accepted.   The   parameter   is   ignored   with   --uchime2_denovo   and
                       --uchime3_denovo.

              --nonchimeras filename
                       Output  non-chimeric  sequences  to filename, in fasta format. Output order may vary when
                       using multiple threads.

              --relabel string
                       Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to  construct  the
                       new headers. Use --sizeout to conserve the abundance annotations.

              --relabel_keep
                       When relabelling, keep the old identifier in the header after a space.

              --relabel_md5
                       Relabel sequences using the MD5 message digest algorithm applied to each sequence. Former
                       sequence headers are discarded. The sequence is converted to upper case and each  'U'  is
                       replaced  by  a  'T'  before computation of the digest. The MD5 digest is a cryptographic
                       hash function designed to minimize the probability that two  different  inputs  give  the
                       same  output,  even  for  very  similar, but non-identical inputs. Still, there is a very
                       small, but non-zero, probability that two different inputs give the same digest  (i.e.  a
                       collision).  MD5  generates  a  128-bit  (16-byte)  digest  that  is  represented  by  16
                       hexadecimal numbers (using 32 symbols among 0123456789abcdef). Use --sizeout to  conserve
                       the abundance annotations.

              --relabel_sha1
                       Relabel sequences using the SHA1 message digest algorithm applied to each sequence. It is
                       similar to the --relabel_md5 option but uses  the  SHA1  algorithm  instead  of  the  MD5
                       algorithm.  SHA1  generates  a  160-bit  (20-byte)  digest  that  is  represented  by  20
                       hexadecimal numbers (40 symbols). The  probability  of  a  collision  (two  non-identical
                       sequences  resulting in the same digest) is smaller for the SHA1 algorithm than it is for
                       the MD5 algorithm.

              --self   When using --uchime_ref, ignore a reference sequence when its label matches the label  of
                       the query sequence (useful to estimate false-positive rate in reference sequences).

              --selfid When  using  --uchime_ref,  ignore  a  reference sequence when its nucleotide sequence is
                       strictly identical to the nucleotidic sequence of the query.

              --sizeout
                       When  relabelling,  add  abundance  annotations  to  fasta  headers  (using  the   format
                       ';size=integer;').

              --uchime_denovo filename
                       Detect  chimeras  present  in  the  fasta-formatted filename, without external references
                       (i.e. de novo). Automatically sort the sequences  in  filename  by  decreasing  abundance
                       beforehand (see the sorting section for details). Multithreading is not supported.

              --uchime2_denovo filename
                       Detect  chimeras  present  in  the fasta-formatted filename, using the UCHIME2 algorithm.
                       This algorithm is designed for denoised amplicons (see  --cluster_unoise).  Automatically
                       sort  the  sequences  in  filename  by  decreasing  abundance beforehand (see the sorting
                       section for details).  Multithreading is not supported.

              --uchime3_denovo filename
                       Detect chimeras present in the fasta-formatted filename, using the UCHIME2 algorithm. The
                       only  difference  from  --uchime2_denovo  is  that  the  default  minimum  abundance skew
                       (--abskew) is set to 16.0 rather than 2.0.

              --uchime_ref filename
                       Detect chimeras present in the fasta-formatted filename by comparing them with  reference
                       sequences (option --db). Multithreading is supported.

              --uchimealns filename
                       Write  the  three-way  global  alignments (parentA, parentB, chimera) to filename using a
                       human-readable format. Use --alignwidth to modify alignment length. Output order may vary
                       when  using multiple threads. All sequences are converted to upper case before alignment.
                       Lower case letters indicate disagreement in the alignment.

              --uchimeout filename
                       Write chimera detection results to filename using a 18-field,  tab-separated  uchime-like
                       format. Use --uchimeout5 to use a format compatible with usearch v5 and earlier versions.
                       Rows output order may vary when using multiple threads.

                              1.  score: higher score means a more likely chimeric alignment.

                              2.  Q: query sequence label.

                              3.  A: parent A sequence label.

                              4.  B: parent B sequence label.

                              5.  T: top parent sequence label (i.e. parent most similar  to  the  query).  That
                                  field is removed when using --uchimeout5.

                              6.  idQM:  percentage  of  similarity  of query (Q) and model (M) constructed as a
                                  part of parent A and a part of parent B.

                              7.  idQA: percentage of similarity of query (Q) and parent A.

                              8.  idQB: percentage of similarity of query (Q) and parent B.

                              9.  idAB: percentage of similarity of parent A and parent B.

                              10. idQT: percentage of similarity of query (Q) and top parent (T).

                              11. LY: yes votes in the left part of the model.

                              12. LN: no votes in the left part of the model.

                              13. LA: abstain votes in the left part of the model.

                              14. RY: yes votes in the right part of the model.

                              15. RN: no votes in the right part of the model.

                              16. RA: abstain votes in the right part of the model.

                              17. div: divergence, defined as (idQM - idQT).

                              18. YN: query is chimeric (Y), or not (N), or is a borderline case (?).

              --uchimeout5
                       When using --uchimeout, write chimera detection results using a  17-field,  tab-separated
                       uchime-like format (drop the 5th field of --uchimeout), compatible with usearch version 5
                       and earlier versions.

              --xn real
                       No vote weight, corresponding to the parameter beta  in  the  scoring  function  (default
                       value is 8.0).

              --xsize  Strip abundance information from the headers when writing the output file.

       Clustering options:

              vsearch  implements  a  single-pass,  greedy  centroid-based  clustering algorithm, similar to the
              algorithms implemented in usearch, DNAclust and sumaclust for example.  Important  parameters  are
              the global clustering threshold (--id) and the pairwise identity definition (--iddef).

              Input sequences are masked as specified with the --qmask and --hardmask options.

              --biomout filename
                       Generate  an  OTU  table  in  the  biom  version  1.0  JSON  file  format as specified at
                       http://biom-format.org/documentation/format_versions/biom-1.0.html.  The format describes
                       how  to  store  a  sparse  matrix  containing the abundances of the OTUs in the different
                       samples. This format is much more efficient than the classic and mothur OTU table formats
                       available  with  the  --otutabout  and  --mothur_shared_out options, respectively, and is
                       recommended at least for large tables. The OTUs are represented by the cluster centroids.
                       Taxonomy  information will be included for the OTUs if available. Sample identifiers will
                       be extracted from the headers of all sequences in the input file. If the header  contains
                       ';sample=abc123;'  or  ';barcodelabel=abc123;'  or  a  similar string somewhere, then the
                       given sample identifier (here 'abc123') will be used. The semicolon is not  mandatory  at
                       the  beginning  or  end  of  the  header. The sample identifier may contain any printable
                       character except semicolons. If no such sample label is  found,  the  identifier  in  the
                       initial  part  of  the  header will be used, but only letters, digits and underscores are
                       allowed. OTU identifiers will be extracted from  the  headers  of  the  cluster  centroid
                       sequences.  If the header contains ';otu=def789;' or a similar string somewhere, then the
                       given OTU identifier (here 'def789') will be used. The semicolon is not mandatory at  the
                       beginning  or  end  of the header. The OTU identifier may contain any printable character
                       except semicolons. If no such OTU label is found, the identifier in the initial  part  of
                       the header will be used, and all characters except semicolons are allowed. Alternatively,
                       OTU identifers can be generated using the relabelling options (--relabel,  --relabel_sha1
                       or  --relabel_md5).  Taxonomy  information,  if  present, will also be extracted from the
                       headers of the centroid sequences. If  the  header  contains  ';tax=Homo_sapiens;'  or  a
                       similar  string somewhere, then the given taxonomy information (here 'Homo_sapiens') will
                       be used. The semicolon is not mandatory at the  beginning  or  end  of  the  header.  The
                       taxonomy  information  may  contain  any printable character except semicolons. If an OTU
                       table in the biom version 2.1 HDF5 file format is required, the biom utility may be  used
                       as described at http://biom-format.org/documentation/biom_conversion.html.

              --centroids filename
                       Output  cluster  centroid  sequences  to  filename,  in fasta format. The centroid is the
                       sequence that seeded the cluster (i.e. the first sequence of the cluster).

              --clusterout_id
                       Add cluster identifier information to the output  files  when  using  the  --consout  and
                       --profile options.

              --clusterout_sort
                       Sort  output  files  by  decreasing  abundance  when  using  the  --consout, --msaout and
                       --profile options.

              --cluster_fast filename
                       Clusterize the fasta sequences in filename, automatically  sort  by  decreasing  sequence
                       length beforehand.

              --cluster_size filename
                       Clusterize  the  fasta  sequences  in filename, automatically sort by decreasing sequence
                       abundance beforehand.

              --cluster_smallmem filename
                       Clusterize the fasta sequences in filename without automatically  modifying  their  order
                       beforehand.  Sequence  are  expected  to  be sorted by decreasing sequence length, unless
                       --usersort is used.

              --cluster_unoise filename
                       Perform denoising of the fasta sequences in filename according to the  UNOISE  version  3
                       algorithm  by  Robert  Edgar, but without the chimera removal step. The options --minsize
                       (default 8) and --unoise_alpha (default 2.0) may be specified. Chimera removal (de  novo)
                       should be performed afterwards with --uchime3_denovo.

              --clusters string
                       Output  each cluster to a separate fasta file using the prefix string and a ticker (0, 1,
                       2, etc.) to construct the path and filenames.

              --consout filename
                       Output cluster consensus sequences to filename. For each cluster, a multiple alignment is
                       computed,  and  a  consensus  sequence  is  constructed  by  taking  the  majority symbol
                       (nucleotide or gap) from each column of the alignment. Columns containing a  majority  of
                       gaps are skipped, except for terminal gaps.

              --cons_truncate
                       This command is ignored. A warning is issued.

              --id real
                       Do  not add the target to the cluster if the pairwise identity with the centroid is lower
                       than real (value ranging from 0.0 to 1.0 included). The pairwise identity is  defined  as
                       the  number  of  (matching columns) / (alignment length - terminal gaps). That definition
                       can be modified by --iddef.

              --iddef 0|1|2|3|4
                       Change the pairwise identity definition used in --id. Values accepted are:

                              0.  CD-HIT definition: (matching columns) / (shortest sequence length).

                              1.  edit distance: (matching columns) / (alignment length).

                              2.  edit distance excluding terminal gaps (same as --id).

                              3.  Marine Biological Lab  definition  counting  each  gap  opening  (internal  or
                                  terminal)  as  a  single  mismatch, whether or not the gap was extended: 1.0 -
                                  [(mismatches + gap openings)/(longest sequence length)]

                              4.  BLAST definition, equivalent to --iddef 1 in  a  context  of  global  pairwise
                                  alignment.

              --minsize positive integer
                       Specify  the  minimum  abundance  of  sequences for denoising using --cluster_unoise. The
                       default is 8.

              --msaout filename
                       Output a multiple sequence alignment  and  a  consensus  sequence  for  each  cluster  to
                       filename,  in fasta format. Be warned that vsearch computes center star multiple sequence
                       alignments using a fast method whose accuracy can decrease significantly when  using  low
                       pairwise  identity  thresholds.  The  consensus  sequence  is  constructed  by taking the
                       majority symbol (nucleotide or gap) from each column of the alignment. Columns containing
                       a majority of gaps are skipped, except for terminal gaps.

              --mothur_shared_out filename
                       Output  an  OTU table in the mothur 'shared' tab-separated plain text format as described
                       at http://www.mothur.org/wiki/Shared_file. The format describes how a  matrix  containing
                       the  abundances of the OTUs in the different samples is stored. The first line will start
                       with the strings 'label', 'group' and 'numOtus' and is followed by  a  list  of  all  OTU
                       identifiers.  The  following lines, one for each sample, starts with the string 'vsearch'
                       followed by the sample identifier, the total number of OTUs, and a list of abundances for
                       each  OTU  in  that  sample,  in  the  order  given on the first line. The OTU and sample
                       identifiers are extracted  from  the  FASTA  headers  of  the  sequences.  The  OTUs  are
                       represented by the cluster centroids. See the --biomout option for further details.

              --otutabout filename
                       Output an OTU table in the classic tab-separated plain text format as a matrix containing
                       the abundances of the OTUs in the different samples. The first line will start  with  the
                       string  '#OTU  ID' and is followed by a tab-separated list of all sample identifiers. The
                       following lines, one for each OTU, starts with the OTU identifier and is  followed  by  a
                       tab-separated  list  of abundances for that OTU in each sample, in the order given on the
                       first line. The OTU and sample identifiers are extracted from the FASTA  headers  of  the
                       sequences. The OTUs are represented by the cluster centroids. An extra column is added to
                       the right of the table if taxonomy information is available for at least one of the OTUs.
                       This  column  will  be  labelled  'taxonomy'  and each row will then contain the taxonomy
                       information extracted for that OTU. See the --biomout option for further details.

              --profile filename
                       Output a sequence profile to a text file with the frequency of each  nucleotide  in  each
                       position  in  the  multiple alignment for each cluster. There is a FASTA-like header line
                       for each cluster, followed by the profile information  in  a  tab-separated  format.  The
                       eight  columns are: position (0-based), consensus nucleotide, number of As, number of Cs,
                       number of Gs, number of Ts or Us, number of gap symbols, and finally the total number  of
                       ambiguous  nucleotide  symbols  (B,  D,  H,  K,  M,  N, R, S, Y, V or W). All numbers are
                       integers.

              --qmask none|dust|soft
                       Mask regions in sequences using the dust or the soft methods,  or  do  not  mask  (none).
                       Warning,  when  using  soft masking, clustering becomes case sensitive. The default is to
                       mask using dust.

              --relabel string
                       Relabel sequence identifiers in the output files produced  by  --consout,  --profile  and
                       --centroids  options.  Please  see  the  description  of  the  same  option under Chimera
                       detection for details.

              --relabel_keep
                       When relabelling, keep the old identifier in the header after a space.

              --relabel_md5
                       Relabel sequence identifiers in the output files produced  by  --consout,  --profile  and
                       --centroids  options.  Please  see  the  description  of  the  same  option under Chimera
                       detection for details.

              --relabel_sha1
                       Relabel sequence identifiers in the output files produced  by  --consout,  --profile  and
                       --centroids  options.  Please  see  the  description  of  the  same  option under Chimera
                       detection for details.

              --sizein Take into account the abundance annotations present in the input fasta file  (search  for
                       the pattern '[>;]size=integer[;]' in sequence headers).

              --sizeorder
                       When an amplicon is close to 2 or more centroids, both within the distance specified with
                       the --id option, resolve the ambiguity by clustering it  with  the  centroid  having  the
                       highest  abundance,  not necessarily the closest one. The option only has effect when the
                       value specified with --maxaccepts is higher than one. The  --sizeorder  option  turns  on
                       what  is sometimes referred to as abundance-based greedy clustering (AGC), in contrast to
                       the default distance-based greedy clustering (DGC).

              --sizeout
                       Add abundance annotations to the output fasta files (add the pattern specified, abundance
                       annotations  are  reported  to  output  files,  and  each cluster centroid receives a new
                       abundance value corresponding to the total abundance of the  amplicons  included  in  the
                       cluster (--centroids option). If --sizein is not specified, input abundances are set to 1
                       for amplicons, and to the number of amplicons per cluster for centroids.

              --strand plus|both
                       When comparing sequences with the cluster seed, check the plus strand only  (default)  or
                       check both strands.

              --uc filename
                       Output  clustering  results  in filename using a tab-separated uclust-like format with 10
                       columns and 3 different type of entries (S, H or C). Each fasta  sequence  in  the  input
                       file  can  be  either  a cluster centroid (S) or a hit (H) assigned to a cluster. Cluster
                       records (C) summarize information (size, centroid label) for each cluster. In the context
                       of  clustering,  the option --uc_allhits has no effect on the --uc output. Column content
                       varies with the type of entry (S, H or C):

                              1.  Record type: S, H, or C.

                              2.  Cluster number (zero-based).

                              3.  Centroid length (S), query length (H), or cluster size (C).

                              4.  Percentage of similarity with the centroid sequence (H), or set to '*' (S, C).

                              5.  Match orientation + or - (H), or set to '*' (S, C).

                              6.  Not used, always set to '*' (S, C) or to zero (H).

                              7.  Not used, always set to '*' (S, C) or to zero (H).

                              8.  set to '*' (S, C) or, for H, compact representation of the pairwise  alignment
                                  using  the  CIGAR  format  (Compact  Idiosyncratic Gapped Alignment Report): M
                                  (match), D (deletion) and I (insertion). The equal sign '=' indicates that the
                                  query is identical to the centroid sequence.

                              9.  Label of the query sequence (H), or of the centroid sequence (S, C).

                              10. Label of the centroid sequence (H), or set to '*' (S, C).

              --unoise_alpha real
                       Specify the alpha parameter to the --cluster_unoise command. The default i 2.0.

              --usersort
                       When  using  --cluster_smallmem,  allow  any  sequence input order, not just a decreasing
                       length ordering.

              --xsize  Strip abundance information from the headers when writing the output file.

              ...      Most searching options as well as score filtering, gap penalties and masking  also  apply
                       to  clustering  (see  the  Searching  section  for  definitions):  --alnout, --blast6out,
                       --fastapairs, --matched, --notmatched,  --maxaccept,  --maxreject,  --samout,  --userout,
                       --userfields

       Dereplication and rereplication options:

              --derep_fulllength filename
                       Merge strictly identical sequences contained in filename. Identical sequences are defined
                       as having the same length and the same string of nucleotides (case insensitive, T  and  U
                       are considered the same).

              --derep_prefix filename
                       Merge  sequences  with  identical  prefixes  contained  in  filename.   A  short sequence
                       identical to an initial segment (prefix) of another sequence is considered a replicate of
                       the  longer  sequence.  If  a  sequence  is identical to the prefix of two or more longer
                       sequences, it is clustered with the shortest of them. If they are  equally  long,  it  is
                       clustered  with  the  most abundant. Remaining ties are solved using sequence headers and
                       sequence input order. Sequence  comparisons  are  case  insensitive,  and  T  and  U  are
                       considered identical.

              --maxuniquesize positive integer
                       Discard sequences with a post-dereplication abundance value greater than integer.

              --minuniquesize positive integer
                       Discard sequences with a post-dereplication abundance value smaller than integer.

              --output filename
                       Write  the  dereplicated  sequences to filename, in fasta format and sorted by decreasing
                       abundance. Identical sequences receive the header of the first sequence of  their  group.
                       If  --sizeout  is  used,  the  number of occurrences (i.e. abundance) of each sequence is
                       indicated at the end of their fasta header using the pattern

              --relabel string
                       Please see the description of the same option under Chimera detection for details.

              --relabel_keep
                       When relabelling, keep the old identifier in the header after a space.

              --relabel_md5
                       Please see the description of the same option under Chimera detection for details.

              --relabel_sha1
                       Please see the description of the same option under Chimera detection for details.

              --rereplicate filename
                       Duplicate each sequence the number of times indicated by the abundance of  each  sequence
                       in  the  specified  file. The sequence labels are identical for the same sequence, unless
                       --relabel, --relabel_sha1 or --relabel_md5 is used to create  unique  labels.  Output  is
                       written  to the file specified with the --output option, in FASTA format. The output file
                       does not contain abundance information unless --sizeout is specified, in  which  case  an
                       abundance of 1 is used.

              --sizein Take  into  account the abundance annotations present in the input fasta file (search for
                       the pattern '[>;]size=integer[;]' in sequence headers).

              --sizeout
                       Add abundance annotations to the output fasta  file  (add  the  pattern  specified,  each
                       unique  sequence receives a new abundance value corresponding to its total abundance (sum
                       of the abundances of its occurrences). If --sizein is not specified, input abundances are
                       set  to  1,  and each unique sequence receives a new abundance value corresponding to its
                       number of occurrences in the input file.

              --strand plus|both
                       When searching for strictly identical sequences, check the plus strand only (default)  or
                       check both strands.

              --topn positive integer
                       Output only the top integer sequences (i.e. the most abundant).

              --uc filename
                       Output  full-length  or  prefix-dereplication  results  in filename using a tab-separated
                       uclust-like format with 10 columns and 3 different type of entries  (S,  H  or  C).  Each
                       fasta  sequence  in  the  input  file  can  be either a cluster centroid (S) or a hit (H)
                       assigned to a cluster. Cluster records (C) summarize information (size,  centroid  label)
                       for  each cluster. In the context of dereplication, the option --uc_allhits has no effect
                       on the --uc output. Column content varies with the type of entry (S, H or C):

                              1.  Record type: S, H, or C.

                              2.  Cluster number (zero-based).

                              3.  Sequence length (S, H), or cluster size (C).

                              4.  Percentage of similarity with the centroid sequence (H), or set to '*' (S, C).

                              5.  Match orientation + or - (H), or set to '*' (S, C).

                              6.  Not used, always set to '*' (S, C) or 0 (H).

                              7.  Not used, always set to '*' (S, C) or 0 (H).

                              8.  Not used, always set to '*'.

                              9.  Label of the query sequence (H), or of the centroid sequence (S, C).

                              10. Label of the centroid sequence (H), or set to '*' (S, C).

              --xsize  Strip abundance information from the headers when writing the output file.

       FASTA/FASTQ file processing options:

              Analyse, shorten, filter, convert or  merge  sequences  in  FASTQ  files,  or  reverse  complement
              sequences in FASTA or FASTQ files. The --fastq_chars command can be used to analyse FASTQ files to
              identify the quality encoding and the range of quality  score  values  used.  To  convert  between
              different  FASTQ  file  variants,  use  the  --fastq_convert  command. Statistical analysis of the
              quality and length of the sequences in a FASTQ file  may  be  performed  with  the  --fastq_stats,
              --fastq_eestats, and --fastq_eestats2 commands. Sequences may be shortened, filtered and converted
              by the --fastq_filter or --fastx_filter  commands.  Paired-end  reads  can  be  merged  using  the
              --fastq_mergepairs command. Finally, the --fastx_revcomp command reverse-complements sequences.

              --eeout  When  using  --fastq_filter  or --fastq_mergepairs, include the number of expected errors
                       (ee) in the sequence header of FASTQ and FASTA files. This option is  a  synonym  of  the
                       --fastq_eeout option.

              --eetabbedout filename
                       When specified with the --fastq_mergepairs command, write statistics with expected errors
                       of each merged read to the given file. The  file  is  a  tab  separated  file  with  four
                       columns: The number of errors expected in the forward read, the number of expected errors
                       in the reverse read, the number of observed errors in the forward read, and the number of
                       observed  errors  in  the  reverse  read. The observed number of errors are the number of
                       differences in the overlap region of the merged sequence relative to each of the reads in
                       the pair.

              --fastaout filename
                       When  using  --fastq_filter,  --fastq_mergepairs  or  --fastx_filter,  write to the given
                       FASTA-formatted file the sequences passing the filter, or the merged sequences.

              --fastaout_notmerged_fwd filename
                       When using --fastq_mergepairs, write forward reads not  merged  to  the  specified  FASTA
                       file.

              --fastaout_notmerged_rev filename
                       When  using  --fastq_mergepairs,  write  reverse  reads not merged to the specified FASTA
                       file.

              --fastaout_discarded filename
                       Write sequences that do not pass the  filter  of  the  --fastq_filter  or  --fastx_filter
                       command to the given FASTA-formatted file.

              --fastq_allowmergestagger
                       When  using  --fastq_mergepairs, allow to merge staggered read pairs. Staggered pairs are
                       pairs where the 3' end of the reverse read has an overhang to the left of the 5'  end  of
                       the  forward  read. This situation can occur when a very short fragment is sequenced. The
                       3' overhang of the reverse read is not included in  the  merged  sequence.  The  opposite
                       option is the --fastq_nostagger option. The default is to discard staggered pairs.

              --fastq_ascii positive integer
                       Define  the  ASCII  character  number  used as the basis for the FASTQ quality score. The
                       default is 33, which is used by the Sanger / Illumina 1.8+ FASTQ format  (phred+33).  The
                       value 64 is used by the Solexa, Illumina 1.3+ and Illumina 1.5+ formats (phred+64).

              --fastq_asciiout positive integer
                       When  using  --fastq_convert, define the ASCII character number used as the basis for the
                       FASTQ quality score when writing FASTQ output files. The default is 33.

              --fastq_chars filename
                       Summarize the composition of sequence and quality strings contained in  the  input  FASTQ
                       file.  For each of the four DNA letters, --fastq_chars gives the number of occurrences of
                       the letter, its relative frequency and the length of the longest run of that letter.  For
                       each character present in the quality strings, --fastq_chars gives the ASCII value of the
                       character, its relative frequency, and the number of times  a  k-mer  of  that  character
                       appears  at  the  end  of  quality  strings.  The  length  of  the k-mer can be set using
                       --fastq_tail (4 by default). The command --fastq_chars tries to automatically detect  the
                       quality  encoding  (Solexa,  Illumina  1.3+,  Illumina  1.5+  or Illumina 1.8+/Sanger) by
                       analyzing the range of observed quality score values. In case of  success,  --fastq_chars
                       suggests  values  for the --fastq_ascii (33 or 64), --fastq_qmin and --fastq_qmax options
                       to be used with the other commands that require a FASTQ input file.

              --fastq_convert filename
                       Convert between the different variants of the FASTQ file format. The quality encoding  of
                       the  input  file  must  be  specified with the --fastq_ascii option (either 33 or 64, the
                       default  is  33),  and  the  output  quality  encoding  must  be   specified   with   the
                       --fastq_asciiout  option  (default 33). The mimimum and maximum output quality scores may
                       be limited using the --fastq_qminout and --fastq_qmaxout  options.  The  output  file  is
                       specified with the --fastqout option.

              --fastq_eeout
                       When  using  --fastq_filter  or --fastq_mergepairs, include the number of expected errors
                       (ee) in the sequence header of FASTQ and FASTA files. This option is  a  synonym  of  the
                       --eeout option.

              --fastq_eestats filename
                       Analyze  a FASTQ file and report statistics on the distributions of quality scores, error
                       probabilities and expected accumulated errors. The report, a table  of  21  tab-separated
                       columns,  is  written  to  the  file specified with the --output option. The first column
                       corresponds to the position in the reads (Pos). The second and third  columns  correspond
                       to  the  number  of  reads  (Reads)  and  percentage of reads (PctRecs) that include this
                       position. The remaining columns include information about  the  distribution  of  quality
                       scores  in  this position (Q), error probabilities in this position (Pe), and finally the
                       expected number of accumulated errors from the beginning  of  the  reads  and  until  the
                       current  position  (EE).  For  each  of  the  Q,  Pe  and EE distributions, the following
                       statistics are included: minimum value (Min), lower quartile (Low),  median  (Med),  mean
                       (Mean),  upper quartile (Hi), and maximum value (Max). The quality encoding and the range
                       of quality values may be specified with --fastq_ascii --fastq_qmin and --fastq_qmax.

              --fastq_eestats2 filename
                       Analyze the specified FASTQ file and report statistics on the number  of  sequences  that
                       would  be retained at a combination of selected cutoffs for length truncation and maximum
                       expected errors, that could potentially be used as arguments to the --fastq_trunclen  and
                       --fastq_maxee  options to the --fastq_filter command.  The result, a table of two or more
                       columns, is written to the file specified with the --output option. There is a  line  for
                       each  length  truncation  cutoff.  The  first  column  on each line contains the selected
                       truncation length, while the following columns contain the number of  sequences  and,  in
                       parenthesis,  the  percentage  of  sequences  that  would  be retained at the selected EE
                       levels.  The truncation length cutoffs may be specified with the --length_cutoffs  option
                       and requires a list of three comma-separated integers indicating the shortest cutoff, the
                       longest cutoff, and the increment between cutoffs. The longest cutoff  may  be  specified
                       with  a  star  (*) which indicates that the limit is equal to the longest sequence in the
                       input file. The default setting is "50,*,50" meaning that truncation lengths of 50,  100,
                       150  and  so  on  up to the longest sequence length should be used.  The maximum expected
                       error (EE) cutoffs may be specified with the --ee_cutoffs option which requires a  comma-
                       separated  list  of  floating  point  numbers  as  its  argument.  The default setting is
                       "0.5,1.0,2.0" that indicates that expected error levels of 0.5, 1.0  and  2.0  should  be
                       used.

              --fastq_filter filename
                       Shorten  and/or  filter  sequences in the given FASTQ file. Similar to the --fastx_filter
                       command, but works only on FASTQ files. See --fastx_filter for details.

              --fastq_maxdiffs positive integer
                       When using --fastq_mergepairs, specify the maximum  number  of  non-matching  nucleotides
                       allowed  in the overlap region. That option has a strong influence on the merging success
                       rate. The default value is 10.

              --fastq_maxee real
                       When using --fastq_filter, --fastq_mergepairs or --fastx_filter, discard  sequences  with
                       more than the specified number of expected errors.

              --fastq_maxee_rate real
                       When  using  --fastq_filter  or  --fastx_filter,  discard  sequences  with  more than the
                       specified number of expected errors per base.

              --fastq_maxlen positive integer
                       When using --fastq_filter, --fastq_mergepairs or --fastx_filter, discard  sequences  with
                       more than the specified number of bases.

              --fastq_maxmergelen positive integer
                       When  using  --fastq_mergepairs,  specify  the  maximum length of the merged sequence. By
                       default there is no limit.

              --fastq_maxns positive integer
                       When using --fastq_filter, --fastq_mergepairs or --fastx_filter, discard  sequences  with
                       more than the specified number of N's.

              --fastq_mergepairs filename
                       Merge paired-end sequence reads into one sequence. The forward reads are specified as the
                       argument to this option and the reverse reads are specified with  the  --reverse  option.
                       The  merged  sequences  are  output  to  the  file(s)  specified  with  the --fastaout or
                       --fastqout options. The non-merged reads can be output to the files  specified  with  the
                       --fastaout_notmerged_fwd,    --fastaout_notmerged_rev,    --fastqout_notmerged_fwd    and
                       --fastqout_notmerged_rev options. Statistics may be output to the file specified with the
                       --eetabbedout  option.  Sequences  are  truncated as specified with the --fastq_truncqual
                       option to remove low-quality bases in the 3' end. Sequences shorter than  specified  with
                       --fastq_minlen  (after  truncation) are discarded (1 by default). Sequences with too many
                       ambiguous bases (N's), as specified with the --fastq_maxns are also discarded  (no  limit
                       by  default).  Staggered reads are not merged unless the --fastq_allowmergestagger option
                       is specified. The minimum length of the overlap region between the reads may be specified
                       with  the  --fastq_minovlen  option  (default 10), and the overlap region may not include
                       more mismatches  than  specified  with  the  --fastq_maxdiffs  option  (10  by  default),
                       otherwise  the read pair is discarded.  Additional rules will avoid merging of reads that
                       cannot be aligned reliably and unambiguously. The  mimimum  and  maximum  length  of  the
                       merged  sequence  may  be  specified with the --fastq_minmergelen and --fastq_maxmergelen
                       options,  respectively.  Other  relevant  options  are:   --fastq_ascii,   --fastq_maxee,
                       --fastq_nostagger,  --fastq_qmax,  --fastq_qmaxout,  --fastq_qmin,  --fastq_qminout,  and
                       --label_suffix.

              --fastq_minlen positive integer
                       When using --fastq_filter, --fastq_mergepairs or --fastx_filter, discard  sequences  with
                       less than the specified number of bases (default 1).

              --fastq_minmergelen positive integer
                       When  using  --fastq_mergepairs,  specify  the minimum length of the merged sequence. The
                       default is 1.

              --fastq_minovlen positive integer
                       When using --fastq_mergepairs, specify the minimum overlap between the merged reads.  The
                       default is 10.

              --fastq_nostagger
                       When  using  --fastq_mergepairs,  forbid the merging of staggered read pairs. This is the
                       default  behaviour  of  --fastq_mergepairs.   To   change   that   behaviour,   see   the
                       --fastq_allowmergestagger option.

              --fastq_qmax positive integer
                       Specify  the  maximum quality score accepted when reading FASTQ files. The default is 41,
                       which is usual for recent Sanger/Illumina 1.8+ files.

              --fastq_qmaxout positive integer
                       When using --fastq_convert, specify the maximum quality score  used  when  writing  FASTQ
                       files.  The  default  is  41, which is usual for recent Sanger/Illumina 1.8+ files. Older
                       formats may use a maximum quality score of 40.

              --fastq_qmin positive integer
                       Specify the minimum quality score accepted for FASTQ files. The default is  0,  which  is
                       usual  for recent Sanger/Illumina 1.8+ files. Older formats may use scores between -5 and
                       2.

              --fastq_qminout positive integer
                       When using --fastq_convert, specify the minimum quality score  used  when  writing  FASTQ
                       files. The default is 0, which is usual for Sanger/Illumina 1.8+ files. Older versions of
                       the format may use scores between -5 and 2.

              --fastq_stats filename
                       Analyze a FASTQ file and report the number of reads it contains. The quality encoding and
                       the  range  of  quality  values  may  be  specified  with  --fastq_ascii --fastq_qmin and
                       --fastq_qmax. That command requires the --log option and outputs the  following  detailed
                       statistics  on read length, quality score, length vs. quality distributions, and length /
                       quality filtering:

                       Read length distribution:

                              1.  L: read length.

                              2.  N: number of reads.

                              3.  Pct: fraction of reads with this length.

                              4:  AccPct: fraction of reads with this length or longer.

                       Quality score distribution:

                              1.  ASCII: character encoding the quality score.

                              2.  Q: Phred quality score.

                              3.  Pe: probability of error associated with the quality score.

                              4.  N: number of bases with this quality score.

                              5.  Pct: fraction of bases with this quality score.

                              6:  AccPct: fraction of bases with this quality score or higher.

                       Length vs. quality distribution:

                              1.  L: position in reads (starting from position 2).

                              2.  PctRecs: fraction of reads with at least this length.

                              3.  AvgQ: average quality score over all reads up to this position.

                              4.  P(AvgQ): error probability corresponding to AvgQ.

                              5.  AvgP: average error probability.

                              6:  AvgEE: average expected error over all reads up to this position.

                              7:  Rate: growth rate of AvgEE between this position and position - 1.

                              8:  RatePct: Rate (as explained above) expressed as a percentage.

                       Effect of expected error and length filtering:
                              The first column indicates read lengths (L). The next four  columns  indicate  the
                              number  of reads that would be retained by the --fastq_filter command if the reads
                              were truncated at length L (option --fastq_trunclen L)  and  filtered  to  have  a
                              maximum  expected  error  of  1.0, 0.5, 0.25 or 0.1 (with the option --fastq_maxee
                              float). The last four columns  indicate  the  fraction  of  reads  that  would  be
                              retained  by the --fastq_filter command using the same length and maximum expected
                              error parameters.

                       Effect of minimum quality and length filtering:
                              The first column indicates read lengths (Len). The next four columns indicate  the
                              fraction  of  reads  that  would  be retained by the --fastq_filter command if the
                              reads were truncated at length Len (option --fastq_trunclen Len) or at  the  first
                              position with a quality Q below 5, 10, 15 or 20 (option --fastq_truncqual Q).

              --fastq_stripleft positive integer
                       When using --fastq_filter or --fastx_filter, strip the specified number of bases from the
                       left end of the reads.

              --fastq_stripright positive integer
                       When using --fastq_filter or --fastx_filter, strip the specified number of bases from the
                       right end of the reads.

              --fastq_tail positive integer
                       When  using  --fastq_chars,  count the number of times a series of characters of length k
                       appears at the end of quality strings. By default, k = 4.

              --fastq_truncee real
                       When using --fastq_filter or --fastx_filter,  truncate  sequences  so  that  their  total
                       expected error is not higher than the specified value.

              --fastq_trunclen positive integer
                       When  using --fastq_filter or --fastx_filter, truncate sequences to the specified length.
                       Shorter sequences are discarded.

              --fastq_trunclen_keep positive integer
                       When using --fastq_filter or --fastx_filter, truncate sequences to the specified  length.
                       Shorter sequences are not discarded.

              --fastq_truncqual positive integer
                       When  using  --fastq_filter or --fastx_filter, truncate sequences starting from the first
                       base with the specified base quality score value or lower.

              --fastqout filename
                       When using --fastq_filter, --fastq_mergepairs  or  --fastx_filter,  write  to  the  given
                       FASTQ-formatted file the sequences passing the filter, or the merged sequences.

              --fastqout_discarded filename
                       When  using --fastq_filter or --fastx_filter, write sequences that do not pass the filter
                       to the given FASTQ-formatted file.

              --fastqout_notmerged_fwd filename
                       When using --fastq_mergepairs, write forward reads not  merged  to  the  specified  FASTQ
                       file.

              --fastqout_notmerged_rev filename
                       When  using  --fastq_mergepairs,  write  reverse  reads not merged to the specified FASTQ
                       file.

              --fastx_filter filename
                       Shorten and/or filter the sequences in the given FASTA  or  FASTQ  file  and  output  the
                       remaining  sequences  to  the  FASTQ file specified with the --fastqout option and to the
                       FASTA file specified with the --fastaout option. The discarded sequences are  written  to
                       the  files  specified with the --fastaout_discarded and --fastqout_discarded options. The
                       input format (FASTA or FASTQ) is automatically detected. Output can  not  be  written  to
                       FASTQ files if the input is in FASTA format. Sequences may be shortened using the options
                       --fastq_stripleft,      --fastq_stripright,      --fastq_truncee,       --fastq_trunclen,
                       --fastq_trunclen_keep  and  --fastq_truncqual.  The  sequences  may be filtered using the
                       options --fastq_maxee, --fastq_maxee_rate, --fastq_maxlen, --fastq_maxns, --fastq_minlen,
                       --fastq_trunclen,  --maxsize,  and --minsize. If shortening results in an empty sequence,
                       it is discarded. The sequences are  first  shortened  and  then  filtered  based  on  the
                       remaining  bases.  If  no  shortening  or  filtering options are given, all sequences are
                       written to the output files, possibly after conversion from FASTQ to  FASTA  format.  The
                       --relabel  option may be used to relabel the output sequences. The --eeout may be used to
                       output the expected number of errors in each sequence.

              --fastx_revcomp filename
                       Reverse-complement the sequences in the given FASTA or FASTQ file  to  a  file  specified
                       with  the --fastaout and/or --fastqout options. If the input file is in FASTA format, the
                       output can not be written back to a FASTQ file due to missing base quality scores.

              --label_suffix string
                       When using --fastx_revcomp or --fastq_mergepairs,  add  the  suffix  string  to  sequence
                       headers.

              --maxsize positive integer
                       When  using  --fastq_filter or --fastx_filter, discard sequences with an abundance higher
                       than the specified value.

              --minsize positive integer
                       When using --fastq_filter or --fastx_filter, discard sequences with  an  abundance  lower
                       than the specified value.

              --output filename
                       When  using --fastq_eestats or --fastq_eestats2, write tabulated results to filename. See
                       --fastq_eestats's and --fastq_eestats2's documentation for a complete description of  the
                       table.

              --relabel_keep
                       When using --relabel, keep the old identifier in the header after a space.

              --relabel string
                       Please see the description of the same option under Chimera detection for details.

              --relabel_md5
                       Please see the description of the same option under Chimera detection for details.

              --relabel_sha1
                       Please see the description of the same option under Chimera detection for details.

              --reverse filename
                       When  using  --fastq_mergepairs, specify the FASTQ file containing containing the reverse
                       reads.

              --xsize  Strip abundance information from the headers when writing the output file.

       Masking options:

              An input sequence can be composed of lower- or uppercase letters. When soft masking is  specified,
              lower  case  letters are treated as symbols that should be masked. Otherwise the case of the input
              sequences is ignored.
              Masking is performed by the commands for chimera detection (uchime_denovo, uchime_ref), clustering
              (cluster_fast,   cluster_smallmem,   cluster_size),   masking  (maskfasta,  fastx_mask),  pairwise
              alignment (allpairs_global) and searching (search_exact, usearch_global).
              Masking is usually specified with the --qmask option, while the --dbmask option is  used  for  the
              database  sequences  specified  with the --db option with the --usearch_global, --search_exact and
              --uchime_ref commands.
              The argument to the --qmask and --dbmask option may be none, soft or  dust.  If  the  argument  is
              none,  the  no  masking  is  performed. If the argument is soft the lower case symbols are masked.
              Finally, if the argument is dust, the sequence is masked using the DUST algorithm by  Tatusov  and
              Lipman to mask low-complexity regions.
              If  the  --hardmask option is specified, all masked regions are converted to N's, otherwise masked
              regions are indicated by lower case letters.
              If any sequence is masked, the masked version of the sequence (with lower case letters or N's)  is
              used  in all output files. Otherwise the sequence is unmodified. The exception is the sequences in
              the output file specified with the --uchimealns option, where the input sequences are converted to
              upper case first and lower case letters indicate disagreement between the aligned sequences.
              When  a sequence region is masked, words in the region are not included in the indices used in the
              heuristic search algorithm. In all other aspects, the region is treated as other regions.
              Regions in sequences that are hardmasked (with N's)  have  a  zero  alignment  score  and  do  not
              contribute to an alignment.
              Here  are the results of combined masking options --qmask (or --dbmask for database sequences) and
              --hardmask, assuming each input sequence contains both lower and uppercase nucleotides:

                               qmask   hardmask                      action
                               ────────────────────────────────────────────────────────────────
                               none    off        no masking, all symbols used, no change
                               none    on         no masking, all symbols used, no change
                               dust    off        masked symbols lowercased, rest uppercased
                               dust    on         masked symbols changed to Ns, rest unchanged
                               soft    off        lowercase symbols masked, no case changes
                               soft    on         lowercase symbols masked and changed to Ns

              --fastaout filename
                       Write the masked sequences to filename, in fasta format. Applies only to the --fastx_mask
                       command.

              --fastqout filename
                       Write the masked sequences to filename, in fastq format. Applies only to the --fastx_mask
                       command.

              --fastx_mask filename
                       Mask regions in sequences contained in the specified fasta or fastq file. The default  is
                       to  mask using DUST (use --qmask to modify that behavior). The output files are specified
                       with the --fastaout and  --fastqout  options.  The  minimum  and  maximum  percentage  of
                       unmasked  residues  may  be  specified with the --min_unmasked_pct and --max_unmasked_pct
                       options, respectively.

              --hardmask
                       Symbols in masked regions are replaced by N's. The  default  is  to  replace  the  masked
                       regions by lower case letters.

              --maskfasta filename
                       Mask  regions  in  sequences contained in the fasta file filename. The default is to mask
                       using dust (use --qmask to modify that behavior). The output file is specified  with  the
                       --output option. This command is depreciated, please use --fastx_mask instead.

              --max_unmasked_pct real
                       Discard  sequences  with more than the specified maximum percentage of unmasked residues.
                       Works only with --fastx_mask.

              --min_unmasked_pct real
                       Discard sequences with less than the specified minimum percentage of  unmasked  residues.
                       Works only with --fastx_mask.

              --output filename
                       Write the masked sequences to filename, in fasta format. Applies only to the --mask_fasta
                       command.

              --qmask none|dust|soft
                       If the argument is dust, mask regions in sequences using the DUST algorithm that  detects
                       simple  repeats and low-complexity regions. This is the default. If the argument is soft,
                       mask the lower case letters in the input sequence. If the argument is none, do not mask.

       Pairwise alignment options:

              The results of the n * (n - 1) / 2 pairwise alignments are written to the result  files  specified
              with --alnout, --blast6out, --fastapairs --matched, --notmatched, --samout, --uc or --userout (see
              Searching section below). Specify either the --acceptall option to output all pairwise alignments,
              or  specify  an  identity  level  with  --id  to discard weak alignments. Most other accept/reject
              options (see Searching options below) may also be used. Sequences are aligned on their plus strand
              only. Masking is performed as usual and specified with --qmask and --hardmask.

              --acceptall
                       Write  the  results  of  all  alignments to output files. This option overrides all other
                       accept/reject options (including --id).

              --allpairs_global filename
                       Perform optimal global pairwise alignments of all vs. all fasta  sequences  contained  in
                       filename. This command is multi-threaded.

              --id real
                       Reject the sequence match if the pairwise identity is lower than real (value ranging from
                       0.0 to 1.0 included).

              --threads positive integer
                       Number of computation threads to use (1 to 256). The number of threads should  be  lesser
                       or  equal  to  the  number  of  available  CPU cores. The default is to use all available
                       resources and to launch one thread per logical core.

              --uc filename
                       Output pairwise alignment results in filename using a  tab-separated  uclust-like  format
                       with  10  columns.  Each  sequence  is  compared  to  all  other  sequences, and all hits
                       (--acceptall) or only some hits (--id float) are reported, with one  pairwise  comparison
                       per line:

                              1.  Record type, always set to 'H'.

                              2.  Ordinal  number  of  the  target sequence (based on input order, starting from
                                  zero).

                              3.  Sequence length.

                              4.  Percentage of similarity with the target sequence.

                              5.  Match orientation, always set to '+'.

                              6.  Not used, always set to zero.

                              7.  Not used, always set to zero.

                              8.  Compact representation of  the  pairwise  alignment  using  the  CIGAR  format
                                  (Compact Idiosyncratic Gapped Alignment Report): M (match), D (deletion) and I
                                  (insertion). The equal sign '=' indicates that the query is identical  to  the
                                  centroid sequence.

                              9.  Label of the query sequence.

                              10. Label of the target sequence.

       Searching options:

              --alnout filename
                       Write  pairwise global alignments to filename using a human-readable format. Use --rowlen
                       to modify alignment length. Output order may vary when using multiple threads.

              --biomout filename
                       Write search results to an OTU table in the biom version 1.0 file format. The query  file
                       contains  the  samples,  while  the  database  file  contains  the  OTUs.  Sample and OTU
                       identifiers are extracted from the header of these sequences. See the --biomout option in
                       the Clustering section for further details.

              --blast6out filename
                       Write search results to filename using a blast-like tab-separated format of twelve fields
                       (listed below), with  one  line  per  query-target  matching  (or  lack  of  matching  if
                       --output_no_hits  is used). Warning, vsearch uses global pairwise alignments, not blast's
                       seed-and-extend algorithm. Therefore, some common blast output  values  (alignment  start
                       and  end,  evalue,  bit score) are reported differently. Output order may vary when using
                       multiple threads. A similar output can be obtain with --userout filename and --userfields
                       query+target+id+alnlen+mism+opens+qlo+qhi+tlo+thi+evalue+bits.    A   complete  list  and
                       description is available in the section 'Userfields' of this manual.

                              1.  query: query label.

                              2.  target: target (database sequence) label. The field is set to '*' if there  is
                                  no alignment.

                              3.  id:  percentage  of  identity  (real  value  ranging  from  0.0 to 100.0). The
                                  percentage identity is defined as 100 * (matching columns) / (alignment length
                                  - terminal gaps). See fields id0 to id4 for other definitions.

                              4.  alnlen: length of the query-target alignment (number of columns). The field is
                                  set to 0 if there is no alignment.

                              5.  mism: number of mismatches in the alignment (zero or positive integer value).

                              6.  opens: number of columns containing a gap opening (zero  or  positive  integer
                                  value).

                              7.  qlo:  first nucleotide of the query aligned with the target. Always equal to 1
                                  if there is an alignment, 0 otherwise (see qilo to ignore initial gaps).

                              8.  qhi: last nucleotide of the query aligned with the target. Always equal to the
                                  length  of  the  pairwise  alignment, 0 otherwise (see qihi to ignore terminal
                                  gaps).

                              9.  tlo: first nucleotide of the target aligned with the query. Always equal to  1
                                  if there is an alignment, 0 otherwise (see tilo to ignore initial gaps).

                              10. thi: last nucleotide of the target aligned with the query. Always equal to the
                                  length of the pairwise alignment, 0 otherwise (see  tihi  to  ignore  terminal
                                  gaps).

                              11. evalue:  expectancy-value (not computed for nucleotide alignments). Always set
                                  to -1.

                              12. bits: bit score (not computed for nucleotide alignments). Always set to 0.

              --db filename
                       Compare query sequences (specified with --usearch_global) to the  fasta-formatted  target
                       sequences contained in filename, using global pairwise alignment. Alternatively, the name
                       of a preformatted UDB database created using the makeudb_usearch command (see below)  may
                       be specified.

              --dbmask none|dust|soft
                       Mask  regions  in the target database sequences using the dust method or the soft method,
                       or do not mask (none). Warning, when using  soft  masking  search  commands  become  case
                       sensitive. The default is to mask using dust.

              --dbmatched filename
                       Write  database  target  sequences  matching  at least one query sequence to filename, in
                       fasta format. If the option --sizeout is used, the number of queries  that  matched  each
                       target sequence is indicated using the pattern ";size=integer;".

              --dbnotmatched filename
                       Write  database  target  sequences  not  matching  query  sequences to filename, in fasta
                       format.

              --fastapairs filename
                       Write pairwise alignments of query and target sequences to filename, in fasta format.

              --fulldp Dummy option for compatibility with usearch. To maximize search sensitivity, vsearch uses
                       a  8-way  16-bit  SIMD  vectorized full dynamic programming algorithm (Needleman-Wunsch),
                       whether or not --fulldp is specified.

              --gapext string
                       Set penalties for a gap extension. See  --gapopen  for  a  complete  description  of  the
                       penalty  declaration system. The default is to initialize the six gap extending penalties
                       using a penalty of 2 for extending internal  gaps  and  a  penalty  of  1  for  extending
                       terminal gaps, in both query and target sequences (i.e. 2I/1E).

              --gapopen string
                       Set  penalties  for  a gap opening. A gap opening can occur in six different contexts: in
                       the query (Q) or in the target (T) sequence, at the left (L) or right  (R)  extremity  of
                       the sequence, or inside the sequence (I). Sequence symbols (Q and T) can be combined with
                       location symbols (L, I, and R),  and  numerical  values  to  declare  penalties  for  all
                       possible  contexts:  aQL/bQI/cQR/dTL/eTI/fTR, where abcdef are zero or positive integers,
                       and '/' is used as a separator.
                       To simplify declarations, the location symbols (L, I, and R) can be combined, the  symbol
                       (E)  can be used to treat both extremities (L and R) equally, and the symbols Q and T can
                       be omitted to treat query and target sequences equally. For instance, the default  is  to
                       declare a penalty of 20 for opening internal gaps and a penalty of 2 for opening terminal
                       gaps (left or right), in both query  and  target  sequences  (i.e.  20I/2E).  If  only  a
                       numerical  value  is  given,  without  any  sequence or location symbol, then the penalty
                       applies to all gap openings. To forbid gap-opening, an  infinite  penalty  value  can  be
                       declared with the symbol '*'. To use vsearch as a semi-global aligner, a null-penalty can
                       be applied to the left (L) or right (R) gaps.
                       vsearch always initializes the six gap opening penalties  using  the  default  parameters
                       (20I/2E).  The  user  is then free to declare only the values he/she wants to modify. The
                       string is scanned from left to right,  accepted  symbols  are  (0123456789/LIREQT*),  and
                       later values override previous values.
                       Please  note  that  vsearch,  in  contrast to usearch, only allows integer gap penalties.
                       Because the lowest gap penalties are 0.5 by default in usearch, all  default  scores  and
                       gap  penalties  in  vsearch  have  been  doubled  to maintain equivalent penalties and to
                       produce identical alignments.

              --hardmask
                       Mask sequence regions by replacing them with Ns instead of setting them to lower case  as
                       is the default. For more information, please see the Masking section.

              --id real
                       Reject the sequence match if the pairwise identity is lower than real (value ranging from
                       0.0 to 1.0 included). The search process sorts target sequences by decreasing  number  of
                       k-mers they have in common with the query sequence, using that information as a proxy for
                       sequence similarity. That efficient pre-filtering also prevents pairwise alignments  with
                       weakly  matching  targets,  as  there  needs  to be at least 6 shared k-mers to start the
                       pairwise alignment, and at least one out of every 16 k-mers from the query needs to match
                       the  target. Consequently, using values lower than --id 0.5 is not likely to capture more
                       weakly matching targets. The pairwise identity is by default defined  as  the  number  of
                       (matching  columns) / (alignment length - terminal gaps). That definition can be modified
                       by --iddef.

              --iddef 0|1|2|3|4
                       Change the pairwise identity definition used in --id. Values accepted are:

                              0.  CD-HIT definition: (matching columns) / (shortest sequence length).

                              1.  edit distance: (matching columns) / (alignment length).

                              2.  edit distance excluding terminal gaps (default definition for --id).

                              3.  Marine Biological Lab  definition  counting  each  gap  opening  (internal  or
                                  terminal)  as  a  single  mismatch, whether or not the gap was extended: 1.0 -
                                  [(mismatches + gap openings)/(longest sequence length)]

                              4.  BLAST definition, equivalent to --iddef 1 for global pairwise alignments.

                       The option --userfields accepts the fields id0 to id4, in addition to the  field  id,  to
                       report the pairwise identity values corresponding to the different definitions.

              --idprefix positive integer
                       Reject the sequence match if the first integer nucleotides of the target do not match the
                       query.

              --idsuffix positive integer
                       Reject the sequence match if the last integer nucleotides of the target do not match  the
                       query.

              --leftjust
                       Reject the sequence match if the pairwise alignment begins with gaps.

              --match integer
                       Score  assigned  to  a  match (i.e. identical nucleotides) in the pairwise alignment. The
                       default value is 2.

              --matched filename
                       Write query sequences matching database target sequences to filename, in fasta format.

              --maxaccepts positive integer
                       Maximum number of hits to accept before stopping the search. The default value is 1. This
                       option  works  in  pair  with  --maxrejects. The search process sorts target sequences by
                       decreasing number of k-mers they have in common  with  the  query  sequence,  using  that
                       information  as  a proxy for sequence similarity. After pairwise alignments, if the first
                       target sequence passes the acceptation criteria, it is  accepted  as  best  hit  and  the
                       search  process stops for that query. If --maxaccepts is set to a higher value, more hits
                       are accepted. If --maxaccepts and --maxrejects are both set to 0, the  complete  database
                       is searched.

              --maxdiffs positive integer
                       Reject  the  sequence  match  if  the  alignment contains at least integer substitutions,
                       insertions or deletions.

              --maxgaps positive integer
                       Reject the sequence match if the  alignment  contains  at  least  integer  insertions  or
                       deletions.

              --maxhits positive integer
                       Maximum  number  of  hits  to  show  once  the  search  is terminated (hits are sorted by
                       decreasing identity). Unlimited by default. That option applies to --alnout, --blast6out,
                       --fastapairs, --samout, --uc, or --userout output files.

              --maxid real
                       Reject  the  sequence  match  if  the percentage of identity between the two sequences is
                       greater than real.

              --maxqsize positive integer
                       Reject query sequences with an abundance greater than integer.

              --maxqt real
                       Reject if the query/target sequence length ratio is greater than real.

              --maxrejects positive integer
                       Maximum number of non-matching target sequences to consider before stopping  the  search.
                       The  default value is 32. This option works in pair with --maxaccepts. The search process
                       sorts target sequences by decreasing number of k-mers they have in common with the  query
                       sequence,  using  that  information  as  a  proxy for sequence similarity. After pairwise
                       alignments, if none of the first  32  examined  target  sequences  pass  the  acceptation
                       criteria,  the  search process stops for that query (no hit). If --maxrejects is set to a
                       higher value, more target sequences are considered. If --maxaccepts and --maxrejects  are
                       both set to 0, the complete database is searched.

              --maxsizeratio real
                       Reject if the query/target abundance ratio is greater than real.

              --maxsl real
                       Reject if the shorter/longer sequence length ratio is greater than real.

              --maxsubs positive integer
                       Reject  the  sequence  match  if  the  pairwise  alignment  contains  more  than  integer
                       substitutions.

              --mid real
                       Reject the sequence match if the percentage of identity is lower than real (ignoring  all
                       gaps, internal and terminal).

              --mincols positive integer
                       Reject the sequence match if the alignment length is shorter than integer.

              --minqt real
                       Reject if the query/target sequence length ratio is lower than real.

              --minsizeratio real
                       Reject if the query/target abundance ratio is lower than real.

              --minsl real
                       Reject if the shorter/longer sequence length ratio is lower than real.

              --mintsize positive integer
                       Reject target sequences with an abundance lower than integer.

              --minwordmatches non-negative integer
                       Minimum  number of word matches required for a sequence to be considered further. Default
                       value is 12 for the default word length 8. For word lengths  3-15,  the  default  minimum
                       word  matches  are 18, 17, 16, 15, 14, 12, 11, 10, 9, 8, 7, 5 and 3, respectively. If the
                       query sequence has fewer unique words than the number specified, all words in  the  query
                       must match. If the argument is 0, no word matches are required.

              --mismatch integer
                       Score  assigned to a mismatch (i.e. different nucleotides) in the pairwise alignment. The
                       default value is -4.

              --mothur_shared_out filename
                       Write search results to an OTU table in the mothur 'shared' tab-separated plain text file
                       format.  The  query file contains the samples, while the database file contains the OTUs.
                       Sample and OTU identifiers are extracted from the header  of  these  sequences.  See  the
                       --otutabout option in the Clustering section for further details.

              --notmatched filename
                       Write  query  sequences  not  matching  database  target  sequences to filename, in fasta
                       format.

              --otutabout filename
                       Write search results to an OTU table in the classic tab-separated plain text format.  The
                       query  file  contains  the samples, while the database file contains the OTUs. Sample and
                       OTU  identifiers  are  extracted  from  the  header   of   these   sequences.   See   the
                       --mothur_shared_out option in the Clustering section for further details.

              --output_no_hits
                       Write  both  matching  and  non-matching  queries  to  --alnout, --blast6out, --samout or
                       --userout output files. Non-matching queries are labelled 'No hits' in --alnout files.

              --pattern string
                       This option is ignored. It is provided for compatibility with usearch.

              --qmask none|dust|soft
                       Mask regions in the query sequences using the dust or the soft algorithms, or do not mask
                       (none).  Warning,  when  using  soft  masking  search commands become case sensitive. The
                       default is to mask using dust.

              --query_cov real
                       Reject if the fraction of the query aligned to the target sequence is  lower  than  real.
                       The  query  coverage  is  computed  as  (matches  +  mismatches) / query sequence length.
                       Internal or terminal gaps are not taken into account.

              --rightjust
                       Reject the sequence match if the pairwise alignment ends with gaps.

              --rowlen positive integer
                       Width of alignment lines in --alnout output. The  default  value  is  64.  Set  to  0  to
                       eliminate wrapping.

              --samheader
                       Include  header  lines  to  the  SAM file when --samout is specified. The header includes
                       lines   starting   with   @HD,    @SQ    and    @PG,    but    no    @RG    lines    (see
                       <https://github.com/samtools/hts-specs>). By default no header line is written.

              --samout filename
                       Write  alignment  results  to  filename using the SAM format (a tab-separated text file).
                       When using the --samheader option, the SAM file starts with header lines. Each non-header
                       line  is a SAM record, which represents either a query-target alignment or the absence of
                       match for a query (output order may  vary  when  using  multiple  threads).  Each  record
                       contains  11  mandatory fields and optional fields (see <https://github.com/samtools/hts-
                       specs> for a complete description of the format):

                              1.  query sequence label.

                              2.  combination of bitwise flags. Possible values are: 0 (top hit), 4 (no hit), 16
                                  (reverse-complemented  hit),  256 (secondary hit, i.e. all hits except the top
                                  hit).

                              3.  target sequence label.

                              4.  first position of a target  aligned  with  the  query  (always  1  for  global
                                  pairwise alignments, 0 if there is no match).

                              5.  mapping quality (ignored, always set to '*').

                              6.  CIGAR string (set to '*' if there is no match).

                              7.  name of the target sequence matching with the next read of the query (for mate
                                  reads only, ignored and always set to '*').

                              8.  position of the primary alignment of the next read  of  the  query  (for  mate
                                  reads only, ignored and always set to 0).

                              9.  target  sequence  length (for multi-segment targets, ignored and always set to
                                  0).

                              10. query sequence (complete, not only  the  segment  aligned  to  the  target  as
                                  usearch does).

                              11. quality string (ignored, always set to '*').

                       Optional fields for query-target matches (number and order of fields
                              may vary):

                              12. AS:i:? alignment score (i.e. percentage of identity).

                              13. XN:i:? next best alignment score (always set to 0).

                              14. XM:i:? number of mismatches.

                              15. XO:i:? number of gap openings (excluding terminal gaps).

                              16. XG:i:? number of gap extensions (excluding terminal gaps).

                              17. NM:i:? edit distance to the target (sum of XM and XG).

                              18. MD:Z:? string for mismatching positions.

                              19. YT:Z:UU string representing the alignment type.

              --search_exact filename
                       Search  for exact full-length matches to the query sequences contained in filename in the
                       database of target sequences (--db). Only  100%  exact  matches  are  reported  and  this
                       command  is  much  faster  than --usearch_global. The --id, --maxaccepts and --maxrejects
                       options are ignored, but the rest of the searching options may be specified.

              --self   Reject the sequence match if the query and target labels are identical.

              --selfid Reject the sequence match if the query and target sequences are strictly identical.

              --sizeout
                       Add abundance annotations to the output of the  option  --dbmatched  (using  the  pattern
                       ';size=integer;'), to report the number of queries that matched each target.

              --strand plus|both
                       When  searching for similar sequences, check the plus strand only (default) or check both
                       strands.

              --target_cov real
                       Reject the sequence match if the fraction of the target sequence  aligned  to  the  query
                       sequence  is lower than real. The target coverage is computed as (matches + mismatches) /
                       target sequence length.  Internal or terminal gaps are not taken into account.

              --top_hits_only
                       Only the top hits between the query and database sequence sets are written to the  output
                       specified   with   the   options   --alnout,   --samout,  --userout,  --blast6out,  --uc,
                       --fastapairs, --matched or --notmatched (but not  --dbmatched  and  --dbnotmatched).  For
                       each query, the top hit is the one presenting the highest percentage of identity (see the
                       --iddef option to change the way identity is measured). For a given query, if several top
                       hits  present  exactly  the  same  percentage of identity, the number of hits reported is
                       controlled by the --maxaccepts value (1 by default).

              --uc filename
                       Output searching results in filename using a tab-separated  uclust-like  format  with  10
                       columns.  When  using  the --search_exact command, the table layout is the same than with
                       the --allpairs_global. When using the --usearch_global command,  the  table  present  two
                       different  type of entries: hit (H) or no hit (N). Each query sequence is compared to all
                       other sequences, and the best hit (--maxaccept 1) or several hits (--maxaccept >  1)  are
                       reported  (H).  Output  order may vary when using multiple threads. Column content varies
                       with the type of entry (H or N):

                              1.  Record type: H, or N ('hit' or 'no hit').

                              2.  Ordinal number of the target sequence (based on  input  order,  starting  from
                                  zero). Set to '*' for N.

                              3.  Sequence length. Set to '*' for N.

                              4.  Percentage of similarity with the target sequence. Set to '*' for N.

                              5.  Match orientation + or -. . Set to '.' for N.

                              6.  Not used, always set to zero for H, or '*' for N.

                              7.  Not used, always set to zero for H, or '*' for N.

                              8.  Compact  representation  of  the  pairwise  alignment  using  the CIGAR format
                                  (Compact Idiosyncratic Gapped Alignment Report): M (match), D (deletion) and I
                                  (insertion).  The  equal sign '=' indicates that the query is identical to the
                                  centroid sequence. Set to '*' for N.

                              9.  Label of the query sequence.

                              10. Label of the target centroid sequence. Set to '*' for N.

              --uc_allhits
                       When using the --uc option, show all hits, not just the top hit for each query.

              --usearch_global filename
                       Compare target sequences (--db) to  the  fasta-formatted  query  sequences  contained  in
                       filename, using global pairwise alignment.

              --userfields string
                       When  using --userout, select and order the fields written to the output file. Fields are
                       separated by '+' (e.g. query+target+id). See the 'Userfields' section for a complete list
                       of fields.

              --userout filename
                       Write  user-defined  tab-separated  output to filename. Select the fields with the option
                       --userfields. Output order may vary when using multiple threads. If --userfields is empty
                       or not present, filename is empty.

              --weak_id real
                       Show hits with percentage of identity of at least real, without terminating the search. A
                       normal search stops as soon as  enough  hits  are  found  (as  defined  by  --maxaccepts,
                       --maxrejects,  and  --id).  As  --weak_id  reports  weak  hits  that are not deduced from
                       --maxaccepts, high --id values can be used, hence preserving both speed and  sensitivity.
                       Logically, real must be smaller than the value indicated by --id.

              --wordlength positive integer
                       Length  of  words  (i.e. k-mers) for database indexing. The range of possible values goes
                       from 3 to 15, but values near 8 or 9 are generally recommended. Longer words  may  reduce
                       the  sensitivity/recall  for  weak similarities, but can increase precision. On the other
                       hand, shorter words may  increase  sensitivity  or  recall,  but  may  reduce  precision.
                       Computation  time generally increases with shorter words and decreases with longer words,
                       but it increases again for very long words. Memory requirements for a part of  the  index
                       increase  with  a factor of 4 each time word length increases by one nucleotide, and this
                       generally becomes significant for long words (12 or more). The default value is 8.

       Shuffling options:
              Fasta entries in the input file are outputted in a pseudo-random order.

              --output filename
                       Write the shuffled sequences to filename, in fasta format.

              --randseed positive integer
                       When shuffling sequence order, use integer as seed. A given seed always produces the same
                       output  order  (useful  for replicability). Set to 0 to use a pseudo-random seed (default
                       behavior).

              --relabel string
                       Relabel sequences using the prefix string and a ticker (1, 2, 3, etc.) to  construct  the
                       new headers. Use --sizeout to conserve the abundance annotations.

              --relabel_keep
                       When relabelling, keep the old identifier in the header after a space.

              --relabel_md5
                       Relabel sequences using the MD5 message digest algorithm applied to each sequence. Former
                       sequence headers are discarded. The sequence is converted to upper case and U is replaced
                       by  T  before  the  digest  is  computed. The MD5 digest is a cryptographic hash function
                       designed to minimize the probability that two different inputs  gives  the  same  output,
                       even for very similar, but non-identical inputs. Still, there is always a very small, but
                       non-zero probability that two different inputs give  the  same  result.  The  MD5  digest
                       generates a 128-bit (16-byte) digest that is represented by 16 hexadecimal numbers (using
                       32 symbols among 0123456789abcdef). Use --sizeout to conserve the abundance annotations.

              --relabel_sha1
                       Relabel sequences using the SHA1 message digest algorithm applied to each sequence. It is
                       similar  to  the  --relabel_md5  option  but  uses  the SHA1 algorithm instead of the MD5
                       algorithm. The SHA1 digest generates a 160-bit (20-byte) result that is represented by 20
                       hexadecimal  numbers  (40  symbols).  The  probability  of a collision (two non-identical
                       sequences having the same digest) is smaller for the SHA1 algorithm than it  is  for  the
                       MD5 algorithm. Use --sizeout to conserve the abundance annotations.

              --sizeout
                       When  using  --relabel,  --relabel_md5  or  --relabel_sha1, preserve and report abundance
                       annotations to the output fasta file (using the pattern ';size=integer;').

              --shuffle filename
                       Pseudo-randomly shuffle the order of sequences contained in filename.

              --topn positive integer
                       Output only the first integer sequences after pseudo-random reordering.

              --xsize  Strip abundance information from the headers when writing the output file.

       Sorting options:
              Fasta  entries  are  sorted  by   decreasing   abundance   (--sortbysize)   or   sequence   length
              (--sortbylength).  To  obtain  a stable sorting order, ties are sorted by decreasing abundance and
              label increasing alpha-numerical order  (--sortbylength),  or  just  by  label  increasing  alpha-
              numerical  order  (--sortbysize). Label sorting assumes that all sequences have unique labels. The
              same applies to  the  automatic  sorting  performed  during  chimera  checking  (--uchime_denovo),
              dereplication (--derep_fulllength), and clustering (--cluster_fast and --cluster_size).

              --maxsize positive integer
                       When using --sortbysize, discard sequences with an abundance value greater than integer.

              --minsize positive integer
                       When using --sortbysize, discard sequences with an abundance value smaller than integer.

              --output filename
                       Write the sorted sequences to filename, in fasta format.

              --relabel string
                       Please see the description of the same option under Chimera detection for details.

              --relabel_keep
                       When relabelling, keep the old identifier in the header after a space.

              --relabel_md5
                       Please see the description of the same option under Chimera detection for details.

              --relabel_sha1
                       Please see the description of the same option under Chimera detection for details.

              --sizeout
                       When  using  --relabel,  report abundance annotations to the output fasta file (using the
                       pattern ';size=integer;').

              --sortbylength filename
                       Sort by decreasing length the sequences contained in filename. See  the  general  options
                       --minseqlength and --maxseqlength to eliminate short and long sequences.

              --sortbysize filename
                       Sort  by  decreasing  abundance  the  sequences  contained in filename (missing abundance
                       values are assumed to be ';size=1'). See the options --minsize and --maxsize to eliminate
                       rare and dominant sequences.

              --topn positive integer
                       Output only the top integer sequences (i.e. the longest or the most abundant).

              --xsize  Strip abundance information from the headers when writing the output file.

       Subsampling options:
              Subsampling  randomly  extracts  a  certain number or a certain percentage of the sequences in the
              input file. If the --sizein option is in effect, the abundances of the input  sequences  is  taken
              into account and the sampling is performed as if the input sequences were rereplicated, subsampled
              and dereplicated before being written to the output file. The extraction is performed as a  random
              sampling  with  a  uniform  distribution  among  the  input  sequences  and  is  performed without
              replacement. The input file is specified with  --fastx_subsample  option,  the  output  files  are
              specified  with the --fastaout and --fastqout options and the amount of sequences to be sampled is
              specified with the --sample_pct or --sample_size options. The sequences not sampled may be written
              to  files  specified  with the options --fasta_discarded and --fastq_discarded. The --fastq_ascii,
              --fastq_qmin and --fastq_qmax options are also available.

              --fastaout filename
                       Write the sampled sequences to filename, in fasta format.

              --fastaout_discarded filename
                       Write the sequences not sampled to filename, in fasta format.

              --fastq_ascii positive integer
                       Define the ASCII character number used as the basis for  the  FASTQ  quality  score.  The
                       default  is  33, which is used by the Sanger / Illumina 1.8+ FASTQ format (phred+33). The
                       value 64 is used by the Solexa, Illumina 1.3+ and Illumina 1.5+ formats (phred+64).

              --fastq_qmax positive integer
                       Specify the maximum quality score accepted when reading FASTQ files. The default  is  41,
                       which is usual for recent Sanger/Illumina 1.8+ files.

              --fastq_qmin positive integer
                       Specify  the  minimum  quality score accepted for FASTQ files. The default is 0, which is
                       usual for recent Sanger/Illumina 1.8+ files. Older formats may use scores between -5  and
                       2.

              --fastqout filename
                       Write the sampled sequences to filename, in fastq format. Requires input in fastq format.

              --fastqout_discarded filename
                       Write  the  sequences  not  sampled to filename, in fastq format. Requires input in fastq
                       format.

              --fastx_subsample filename
                       Perform subsampling from the sequences in the specified input file that is  in  FASTA  or
                       FASTQ format.

              --randseed positive integer
                       Use  integer  as a seed for the pseudo-random generator. A given seed always produces the
                       same output, which is useful for replicability. Set to 0  to  use  a  pseudo-random  seed
                       (default behavior).

              --relabel string
                       Relabel  sequences  using the prefix string and a ticker (1, 2, 3, etc.) to construct the
                       new headers. Use --sizeout to conserve the abundance annotations.

              --relabel_keep
                       When relabelling, keep the old identifier in the header after a space.

              --relabel_md5
                       Relabel sequences using the MD5 message digest algorithm applied to each sequence. Former
                       sequence headers are discarded. The sequence is converted to upper case and U is replaced
                       by T before the digest is computed. The MD5  digest  is  a  cryptographic  hash  function
                       designed to minimize the probability that two different inputs give the same output, even
                       for very similar, but non-identical inputs. Still, there is always a very small, but non-
                       zero probability that two different inputs give the same result. The MD5 digest generates
                       a 128-bit (16-byte) digest that is  represented  by  16  hexadecimal  numbers  (using  32
                       symbols among 0123456789abcdef). Use --sizeout to conserve the abundance annotations.

              --relabel_sha1
                       Relabel sequences using the SHA1 message digest algorithm applied to each sequence. It is
                       similar to the --relabel_md5 option but uses  the  SHA1  algorithm  instead  of  the  MD5
                       algorithm. The SHA1 digest generates a 160-bit (20-byte) result that is represented by 20
                       hexadecimal numbers (40 symbols). The  probability  of  a  collision  (two  non-identical
                       sequences  having  the  same digest) is smaller for the SHA1 algorithm than it is for the
                       MD5 algorithm. Use --sizeout to conserve the abundance annotations.

              --sample_pct real
                       Subsample the given percentage of the input sequences. Accepted values range from 0.0  to
                       100.0.

              --sample_size positive integer
                       Extract the given number of sequences.

              --sizein Take the abundance information of the input file into account, otherwise the abundance of
                       each sequence is considered to be 1.

              --sizeout
                       Write abundance information to the output file.

              --xsize  Strip abundance information from the headers when writing the output file.

       UDB options:
              Databases to be used with the --usearch_global command may be prepared from FASTA files and stored
              to  a  binary  UDB  formatted  file  in  order  to speed up searching. This may be worthwhile when
              searching a large database repeatedly. The sequences are indexed and stored in a way that  can  be
              quickly  loaded  into memory. The commands and options below can be used to create and inspect UDB
              files. An UDB file may be specified with the --db option instead of a FASTA  formatted  file  with
              the --usearch_global command.

              --dbmask none|dust|soft
                       Specify the sequence masking method used with the --makeudb_usearch command, either none,
                       dust or soft. No masking is performed when none is specified. When dust is specified, the
                       DUST  algorithm will be used for masking low complexity regions (short repeats and skewed
                       composition). Lower case letters in the input file will be masked when soft is  specified
                       (soft masking).

              --hardmask
                       Mask sequences by replacing letters with N for the --makeudb_usearch command. The default
                       is to use lower case letters (soft masking).

              --makeudb_usearch filename
                       Create an UDB database file from the FASTA-formatted sequences in the file with the given
                       filename. The UDB database is written to the file specified with the --output option.

              --output filename
                       Specify  the  filename  of  a  FASTA  or UDB output file for the --makeudb_usearch or the
                       --udb2fasta command, respectively.

              --udb2fasta filename
                       Read the UDB database in the file with the given filename and  output  the  sequences  in
                       FASTA format in the file specified by the --output option.

              --udbinfo filename
                       Show information about the UDB database in the file with the given filename.

              --udbstats filename
                       Report  statistics about the indexed words in the UDB database in the file with the given
                       filename.

              --wordlength positive integer
                       Specify the length of the words to be used when creating the UDB database index using the
                       --makeudb_usearch command. Valid numbers range from 3 to 15. The default is 8.

       Userfields (fields accepted by the --userfields option):

              aln      Print  a  string  of M (match), D (delete, i.e. a gap in the query) and I (insert, i.e. a
                       gap in the target) representing the pairwise  alignment.  Empty  field  if  there  is  no
                       alignment.

              alnlen   Print the length of the query-target alignment (number of columns). The field is set to 0
                       if there is no alignment.

              bits     Bit score (not computed for nucleotide alignments). Always set to 0.

              caln     Compact representation  of  the  pairwise  alignment  using  the  CIGAR  format  (Compact
                       Idiosyncratic  Gapped Alignment Report): M (match), D (deletion) and I (insertion). Empty
                       field if there is no alignment.

              evalue   E-value (not computed for nucleotide alignments). Always set to -1.

              exts     Number of columns containing a gap extension (zero or positive integer value).

              gaps     Number of columns containing a gap (zero or positive integer value).

              id       Percentage of identity (real value ranging from 0.0 to 100.0). The percentage identity is
                       defined as 100 * (matching columns) / (alignment length - terminal gaps).

              id0      CD-HIT  definition  of  the percentage of identity (real value ranging from 0.0 to 100.0)
                       using the length of the shortest sequence in the pairwise alignment as denominator: 100 *
                       (matching columns) / (shortest sequence length).

              id1      The  percentage of identity (real value ranging from 0.0 to 100.0) is defined as the edit
                       distance: 100 * (matching columns) / (alignment length).

              id2      The percentage of identity (real value ranging from 0.0 to 100.0) is defined as the  edit
                       distance, excluding terminal gaps. The field id2 is an alias for the field id.

              id3      Marine  Biological  Lab definition of the percentage of identity (real value ranging from
                       0.0 to 100.0), counting each gap opening (internal or terminal)  as  a  single  mismatch,
                       whether  or not the gap was extended, and using the length of the longest sequence in the
                       pairwise alignment as denominator: 100 * (1.0 - [(mismatches + gaps) / (longest  sequence
                       length)]).

              id4      BLAST  definition  of  the percentage of identity (real value ranging from 0.0 to 100.0),
                       equivalent to --iddef 1 in a context of global  pairwise  alignment.  The  field  id4  is
                       always equal to the field id1.

              ids      Number of matches in the alignment (zero or positive integer value).

              mism     Number of mismatches in the alignment (zero or positive integer value).

              opens    Number of columns containing a gap opening (zero or positive integer value).

              pairs    Number  of  columns  containing only nucleotides. That value corresponds to the length of
                       the alignment minus the gap-containing columns (zero or positive integer value).

              pctgaps  Number of columns containing gaps expressed as a percentage of the alignment length (real
                       value ranging from 0.0 to 100.0).

              pctpv    Percentage  of  positive  columns.  When  working  with  nucleotide  sequences,  this  is
                       equivalent to the percentage of matches (real value ranging from 0.0 to 100.0).

              pv       Number of positive columns. When working with nucleotide sequences, this is equivalent to
                       the number of matches (zero or positive integer value).

              qcov     Fraction  of  the  query  sequence  that  is aligned with the target sequence (real value
                       ranging from 0.0 to 100.0). The  query  coverage  is  computed  as  100.0  *  (matches  +
                       mismatches)  /  query  sequence  length.   Internal  or  terminal gaps are not taken into
                       account. The field is set to 0.0 if there is no alignment.

              qframe   Query frame (-3 to +3). That field only concerns coding sequences and is not computed  by
                       vsearch. Always set to +0.

              qhi      Last  nucleotide  of the query aligned with the target. Always equal to the length of the
                       pairwise alignment, 0 otherwise (see qihi to ignore terminal gaps).

              qihi     Last nucleotide of the query aligned with the target (ignoring terminal gaps). Nucleotide
                       numbering starts from 1. The field is set to 0 if there is no alignment.

              qilo     First nucleotide of the query aligned with the target (ignoring initial gaps). Nucleotide
                       numbering starts from 1. The field is set to 0 if there is no alignment.

              ql       Query sequence length (positive integer value). The field is set to  0  if  there  is  no
                       alignment.

              qlo      First  nucleotide  of the query aligned with the target. Always equal to 1 if there is an
                       alignment, 0 otherwise (see qilo to ignore initial gaps).

              qrow     Print the sequence of the query segment as seen in the pairwise alignment (i.e. with  gap
                       insertions if need be). Empty field if there is no alignment.

              qs       Query segment length. Always equal to query sequence length.

              qstrand  Query  strand  orientation  (+ or - for nucleotide sequences). Empty field if there is no
                       alignment.

              query    Query label.

              raw      Raw alignment score (negative, null or positive integer value). The score is the  sum  of
                       match rewards minus mismatch penalties, gap openings and gap extensions. The field is set
                       to 0 if there is no alignment.

              target   Target label. The field is set to '*' if there is no alignment.

              tcov     Fraction of the target sequence that is aligned  with  the  query  sequence  (real  value
                       ranging  from  0.0  to  100.0).  The  target  coverage  is computed as 100.0 * (matches +
                       mismatches) / target sequence length.  Internal or  terminal  gaps  are  not  taken  into
                       account.  The field is set to 0.0 if there is no alignment.

              tframe   Target frame (-3 to +3). That field only concerns coding sequences and is not computed by
                       vsearch. Always set to +0.

              thi      Last nucleotide of the target aligned with the query. Always equal to the length  of  the
                       pairwise alignment, 0 otherwise (see tihi to ignore terminal gaps).

              tihi     Last nucleotide of the target aligned with the query (ignoring terminal gaps). Nucleotide
                       numbering starts from 1. The field is set to 0 if there is no alignment.

              tilo     First nucleotide of the target aligned with the query (ignoring initial gaps). Nucleotide
                       numbering starts from 1. The field is set to 0 if there is no alignment.

              tl       Target  sequence  length  (positive  integer value). The field is set to 0 if there is no
                       alignment.

              tlo      First nucleotide of the target aligned with the query. Always equal to 1 if there  is  an
                       alignment, 0 otherwise (see tilo to ignore initial gaps).

              trow     Print the sequence of the target segment as seen in the pairwise alignment (i.e. with gap
                       insertions if need be). Empty field if there is no alignment.

              ts       Target segment length. Always equal to target sequence length. The field is set to  0  if
                       there is no alignment.

              tstrand  Target  strand  orientation  (+  or  -  for  nucleotide sequences). Always set to '+', so
                       reverse strand matches have tstrand '+' and qstrand

DELIBERATE CHANGES

       If you are a usearch user, our objective is to make you feel at home. That's why vsearch was designed  to
       behave  like  usearch,  to  some  extent.  Like any complex software, usearch is not free from quirks and
       inconsistencies. We decided not to reproduce some of them, and for  complete  transparency,  to  document
       here the deliberate changes we made.

       During  a  search with usearch, when using the options --blast6out and --output_no_hits, for queries with
       no match the number of fields reported is 13, where it should be 12. This is corrected in vsearch.

       The field raw of the --userfields option is not informative in usearch. This is corrected in vsearch.

       The fields qlo, qhi, tlo, thi  now  have  counterparts  (qilo,  qihi,  tilo,  tihi)  reporting  alignment
       coordinates ignoring terminal gaps.

       In  usearch,  when  using  the  option  --output_no_hits,  queries  that receive no match are reported in
       --blast6out file, but not in the alignment output file. This is corrected in vsearch.

       vsearch introduces a new --cluster_size command that  sorts  sequences  by  decreasing  abundance  before
       clustering.

       vsearch reintroduces --iddef alternative pairwise identity definitions that were removed from usearch.

       vsearch extends the --topn option to sorting commands.

       vsearch   extends   the   --sizein   option   to   dereplication   (--derep_fulllength)   and  clustering
       (--cluster_fast).

       vsearch treats T and U as identical nucleotides during dereplication.

       vsearch sorting is stabilized by using sequence abundances or sequences labels as secondary  or  tertiary
       keys.

       vsearch  by  default uses the DUST algorithm for masking low-complexity regions. Masking behavior is also
       slightly changed to be more consistent.

NOVELTIES

       vsearch introduces new commands and new options not present in usearch  7.  They  are  described  in  the
       'Options' section of this manual. Here is a short list:

              - uchime2_denovo, uchime3_denovo, alignwidth, borderline, fasta_score (chimera checking)

              - cluster_size, cluster_unoise, clusterout_id, clusterout_sort, profile (clustering)

              - fasta_width, gzip_decompress, bzip2_decompress (general option)

              - iddef (clustering, pairwise alignment, searching)

              - maxuniquesize (dereplication)

              - relabel_md5  and  relabel_sha1  (chimera  detection, dereplication, FASTQ processing, shuffling,
                sorting)

              - shuffle (shuffling)

              - fastq_eestats, fastq_eestats2, fastq_maxlen, fastq_truncee (FASTQ processing)

              - fastaout_discarded, fastqout_discarded (subsampling)

              - rereplicate (dereplication/rereplication)

EXAMPLES

       Align all sequences in a database with each other and output all pairwise alignments:

              vsearch --allpairs_global database.fas --alnout results.aln --acceptall

       Check for the presence of chimeras (de novo); parents should be at least 1.5  times  more  abundant  than
       chimeras. Output non-chimeric sequences in fasta format (no wrapping):

              vsearch --uchime_denovo queries.fas --abskew 1.5 --nonchimeras results.fas --fasta_width 0

       Cluster  with a 97% similarity threshold, collect cluster centroids, and write cluster descriptions using
       a uclust-like format:

              vsearch --cluster_fast queries.fas --id 0.97 --centroids centroids.fas --uc clusters.uc

       Dereplicate the sequences contained in queries.fas, take into account the abundance  information  already
       present,  write  unwrapped  fasta  sequences  to  queries_unique.fas  with the new abundance information,
       discard all sequences with an abundance of 1:

              vsearch   --derep_fulllength   queries.fas   --sizein   --fasta_width   0    --sizeout    --output
              queries_unique.fas --minuniquesize 2

       Mask  simple  repeats  and low complexity regions in the input fasta file with the DUST algorithm (masked
       regions are lowercased), and write the results to the output file:

              vsearch --maskfasta queries.fas --qmask dust --output queries_masked.fas

       Search queries in a reference database, with a 80%-similarity threshold, take terminal gaps into  account
       when calculating pairwise similarities, output pairwise alignments:

              vsearch --usearch_global queries.fas --db references.fas --id 0.8 --iddef 1 --alnout results.aln

       Search  a  sequence  dataset  against  itself  (ignore  self  hits),  get  all  matches with at least 60%
       similarity, and collect results in a blast-like tab-separated format. Accept an unlimited number of  hits
       (--maxaccepts  0),  and  compare  each  query  to  all  other  sequences,  including  unlikely candidates
       (--maxrejects 0):

              vsearch --usearch_global queries.fas --db queries.fas --self --id 0.6  --blast6out  results.blast6
              --maxaccepts 0 --maxrejects 0

       Shuffle  the  input  fasta file (change the order of sequences) in a repeatable fashion (fixed seed), and
       write unwrapped fasta sequences to the output file:

              vsearch --shuffle queries.fas --output queries_shuffled.fas --randseed 13 --fasta_width 0

       Sort  by  decreasing  abundance  the  sequences  contained  in  queries.fas  (using  the   'size=integer'
       information),  relabel  the  sequences  while preserving the abundance information (with --sizeout), keep
       only sequences with an abundance equal to or greater than 2:

              vsearch  --sortbysize  queries.fas  --output  queries_sorted.fas  --relabel   sampleA_   --sizeout
              --minsize 2

AUTHORS

       Implementation by Torbjørn Rognes and Tomás Flouri, documentation by Frédéric Mahé.

CITATION

       Rognes  T,  Flouri  T,  Nichols  B,  Quince  C,  Mahé F. (2016) VSEARCH: a versatile open source tool for
       metagenomics.  PeerJ 4:e2584 doi: 10.7717/peerj.2584 <https://doi.org/10.7717/peerj.2584>

REPORTING BUGS

       Submit suggestions and bug-reports at <https://github.com/torognes/vsearch/issues>, send a  pull  request
       on  <https://github.com/torognes/vsearch>, or compose a friendly or curmudgeont e-mail to Torbjørn Rognes
       <torognes@ifi.uio.no>.

AVAILABILITY

       Source code and binaries are available at <https://github.com/torognes/vsearch>.

       Copyright (C) 2014-2017, Torbjørn Rognes, Frédéric Mahé and Tomás Flouri

       All rights reserved.

       Contact: Torbjørn Rognes <torognes@ifi.uio.no>, Department of Informatics, University  of  Oslo,  PO  Box
       1080 Blindern, NO-0316 Oslo, Norway

       This  software  is  dual-licensed  and  available under a choice of one of two licenses, either under the
       terms of the GNU General Public License version 3 or the BSD 2-Clause License.

       GNU General Public License version 3

       This program is free software: you can redistribute it and/or modify  it  under  the  terms  of  the  GNU
       General  Public License as published by the Free Software Foundation, either version 3 of the License, or
       (at your option) any later version.

       This program is distributed in the hope that it will be useful, but WITHOUT ANY  WARRANTY;  without  even
       the  implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
       License for more details.

       You should have received a copy of the GNU General Public License along with this program.  If  not,  see
       <http://www.gnu.org/licenses/>.

       The BSD 2-Clause License

       Redistribution  and  use in source and binary forms, with or without modification, are permitted provided
       that the following conditions are met:

       1. Redistributions of source code must retain the above copyright notice, this list of conditions and the
       following disclaimer.

       2.  Redistributions in binary form must reproduce the above copyright notice, this list of conditions and
       the following disclaimer in the documentation and/or other materials provided with the distribution.

       THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY  EXPRESS  OR  IMPLIED
       WARRANTIES,  INCLUDING,  BUT  NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
       PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE  LIABLE  FOR
       ANY  DIRECT,  INDIRECT,  INCIDENTAL,  SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
       LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF  USE,  DATA,  OR  PROFITS;  OR  BUSINESS
       INTERRUPTION)  HOWEVER  CAUSED  AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
       TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE  OF  THIS  SOFTWARE,  EVEN  IF
       ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

       We would like to thank the authors of the following projects for making their source code available:

              - vsearch  includes  code  from  Google's  CityHash  project  by  Geoff Pike and Jyrki Alakuijala,
                providing some excellent hash functions available under a MIT license.

              - vsearch includes code derived from Tatusov and Lipman's DUST  program  that  is  in  the  public
                domain.

              - vsearch  includes  public  domain  code  written by Alexander Peslyak for the MD5 message digest
                algorithm.

              - vsearch includes public domain code written by Steve Reid and others for the SHA1 message digest
                algorithm.

              - vsearch  binaries  may  include  code from the zlib library, copyright Jean-Loup Gailly and Mark
                Adler.

              - vsearch binaries may include code from the bzip2 library, copyright Julian R. Seward.

SEE ALSO

       swipe, an extremely fast pairwise  local  (Smith-Waterman)  database  search  tool  by  Torbjørn  Rognes,
       available at <https://github.com/torognes/swipe>.

       swarm,  a fast and accurate amplicon clustering method by Frédéric Mahé and Torbjørn Rognes, available at
       <https://github.com/torognes/swarm>.

VERSION HISTORY

       New features and important modifications of vsearch (short  lived  or  minor  bug  releases  may  not  be
       mentioned):

              v1.0.0 released November 28th, 2014
                     First public release.

              v1.0.1 released December 1st, 2014
                     Bug  fixes  (sortbysize,  semicolon  after  size  annotation  in headers) and minor changes
                     (labels as secondary sort key for most sorts, treat T and U as identical for dereplication,
                     only output size in --dbmatched file if --sizeout specified).

              v1.0.2 released December 6th, 2014
                     Bug fixes (ssse3/sse4.1 requirement, memory leak).

              v1.0.3 released December 6th, 2014
                     Bug fix (now writes help to stdout instead of stderr).

              v1.0.4 released December 8th, 2014
                     Added  --allpairs_global  option.  Reduce memory requirements slightly and eliminate memory
                     leaks.

              v1.0.5 released December 9th, 2014
                     Fixes a minor bug with --allpairs_global and --acceptall options.

              v1.0.6 released December 14th, 2014
                     Fixes a memory allocation bug in chimera detection (--uchime_ref option).

              v1.0.7 released December 19th, 2014
                     Fixes a bug in the output from chimera detection with the --uchimeout option.

              v1.0.8 released January 22nd, 2015
                     Introduces several changes and bug fixes:

                     - a new linear memory aligner for alignment of sequences longer than 5,000 nucleotides,

                     - a new  --cluster_size  command  that  sorts  sequences  by  decreasing  abundance  before
                       clustering,

                     - meaning of userfields qlo, qhi, tlo, thi changed for compatibility with usearch,

                     - new userfields qilo, qihi, tilo, tihi give alignment coordinates ignoring terminal gaps,

                     - in --uc output files, a perfect alignment is indicated with a '=' sign,

                     - the  option  --cluster_fast  now sorts sequences by decreasing length, then by decreasing
                       abundance and finally by sequence identifier,

                     - default --maxseqlength value set to 50,000 nucleotides,

                     - fix for bug in alignment in rare cases,

                     - fix for lack of detection of under- or overflow in SIMD aligner.

              v1.0.9 released January 22nd, 2015
                     Fixes a bug in the function sorting sequences by decreasing abundance (--sortbysize).

              v1.0.10 released January 23rd, 2015
                     Fixes a bug where the --sizein option was ignored  and  always  treated  as  on,  affecting
                     clustering and dereplication commands.

              v1.0.11 released February 5th, 2015
                     Introduces  the  possibility  to  output  results  in  SAM format (for clustering, pairwise
                     alignment and searching).

              v1.0.12 released February 6th, 2015
                     Temporarily fixes a problem with long headers in FASTA files.

              v1.0.13 released February 17th, 2015
                     Fix a memory allocation problem  when  computing  multiple  sequence  alignments  with  the
                     --msaout  and  --consout options, as well as a memory leak.  Also increased line buffer for
                     reading FASTA files to 4MB.

              v1.0.14 released February 17th, 2015
                     Fix a bug where the multiple alignment and consensus  sequence  computed  after  clustering
                     ignored  the  strand of the sequences. Also decreased size of line buffer for reading FASTA
                     files to 1MB again due to excessive stack memory usage.

              v1.0.15 released February 18th, 2015
                     Fix bug in calculation of identity metric between sequences when using the  MBL  definition
                     (--iddef 3).

              v1.0.16 released February 19th, 2015
                     Integrated patches from Debian for increased compatibility with various architectures.

              v1.1.0 released February 20th, 2015
                     Added  the  --quiet  option to suppress all output to stdout and stderr except for warnings
                     and fatal errors. Added the --log option to write messages to a log file.

              v1.1.1 released February 20th, 2015
                     Added info about --log and --quiet options to help text.

              v1.1.2 released March 18th, 2015
                     Fix bug with large datasets. Fix format of help info.

              v1.1.3 released March 18th, 2015
                     Fix more bugs with large datasets.

              v1.2.0-1.2.19 released July 6th to September 8th, 2015
                     Several new commands and options added. Bugs fixed. Documentation updated.

              v1.3.0 released September 9th, 2015
                     Changed to autotools build system.

              v1.3.1 released September 14th, 2015
                     Several new commands and options. Bug fixes.

              v1.3.2 released September 15th, 2015
                     Fixed memory leaks. Added '-h' shortcut for help. Removed extra 'v' in version number.

              v1.3.3 released September 15th, 2015
                     Fixed bug in hexadecimal digits of MD5 and SHA1 digests. Added --samheader option.

              v1.3.4 released September 16th, 2015
                     Fixed compilation problems with zlib and bzip2lib.

              v1.3.5 released September 17th, 2015
                     Minor configuration/makefile changes to compile to native CPU and simplify makefile.

              v1.4.0 released September 25th, 2015
                     Added --sizeorder option.

              v1.4.1 released September 29th, 2015
                     Inserted public domain MD5 and SHA1 code to eliminate  dependency  on  crypto  and  openssl
                     libraries and their licensing issues.

              v1.4.2 released October 2nd, 2015
                     Dynamic  loading  of  libraries  for  reading gzip and bzip2 compressed files if available.
                     Circumvention of missing gzoffset function in zlib 1.2.3 and earlier.

              v1.4.3 released October 3rd, 2015
                     Fix a bug with determining amount of memory on some versions of Apple OS X.

              v1.4.4 released October 3rd, 2015
                     Remove debug message.

              v1.4.5 released October 6th, 2015
                     Fix memory allocation bug when reading long FASTA sequences.

              v1.4.6 released October 6th, 2015
                     Fix subtle bug in SIMD alignment code that reduced accuracy.

              v1.4.7 released October 7th, 2015
                     Fixes a problem with searching for or  clustering  sequences  with  repeats.  In  this  new
                     version, vsearch looks at all words occurring at least once in the sequences in the initial
                     step. Previously only words occurring exactly once were considered.  In  addition,  vsearch
                     now  requires  at  least  10  words  to  be shared by the sequences, previously only 6 were
                     required. If the query contains less than 10 words, all words must be present for a  match.
                     This  change  seems  to  lead to slightly reduced recall, but somewhat increased precision,
                     ending up with slightly improved overall accuracy.

              v1.5.0 released October 7th, 2015
                     This version introduces the new option --minwordmatches that allows the user to specify the
                     minimum  number  of  matching  unique  words  before  a sequence is considered further. New
                     default values for different word  lengths  are  also  set.  The  minimum  word  length  is
                     increased to 7.

              v1.6.0 released October 9th, 2015
                     This  version  adds the relabeling options (--relabel, --relabel_md5 and --relabel_sha1) to
                     the shuffle command. It also adds the --xsize  option  to  the  clustering,  dereplication,
                     shuffling and sorting commands.

              v1.6.1 released October 14th, 2015
                     Fix bugs and update manual and help text regarding relabelling. Add all relabelling options
                     to the subsampling command. Add the --xsize option to chimera detection, dereplication  and
                     fastq filtering commands. Refactoring of code.

              v1.7.0 released October 14th, 2015
                     Add --relabel_keep option.

              v1.8.0 released October 19th, 2015
                     Added  --search_exact, --fastx_mask and --fastq_convert commands.  Changed most commands to
                     read  FASTQ  input  files  as  well  as  FASTA   files.    Modified   --fastx_revcomp   and
                     --fastx_subsample to write FASTQ files.

              v1.8.1 released November 2nd, 2015
                     Fixes for compatibility with QIIME and older OS X versions.

              v1.9.0 released November 12th, 2015
                     Added  the  --fastq_mergepairs  command  and  associated options. This command has not been
                     tested well yet. Included additional files to avoid dependency of autoconf for compilation.
                     Fixed an error where identifiers in fasta headers where not truncated at tabs, just spaces.
                     Fixed a bug in detection of the file format (FASTA/FASTQ) of a gzip compressed input file.

              v1.9.1 released November 13th, 2015
                     Fixed memory leak and a bug in score computation in --fastq_mergepairs, and improved speed.

              v1.9.2 released November 17th, 2015
                     Fixed a bug in the computation of some values with --fastq_stats.

              v1.9.3 released November 19th, 2015
                     Workaround for missing x86intrin.h with old compilers.

              v1.9.4 released December 3rd, 2015
                     Fixed incrementation of counter when relabeling dereplicated sequences.

              v1.9.5 released December 3rd, 2015
                     Fixed bug resulting in inferior chimera detection performance.

              v1.9.6 released January 8th, 2016
                     Fixed bug in aligned sequences  produced  with  --fastapairs  and  --userout  (qrow,  trow)
                     options.

              v1.9.7 released January 12th, 2016
                     Masking  behavior  is  changed  somewhat  to  keep  the  letter case of the input sequences
                     unchanged when no masking is performed.  Masking  is  now  performed  also  during  chimera
                     detection. Documentation updated.

              v1.9.8 released January 22nd, 2016
                     Fixed  bug  causing  segfault  when  chimera  detection  is  performed  on  extremely short
                     sequences.

              v1.9.9 released January 22nd, 2016
                     Adjusted default minimum number of word matches during searches for improved performance.

              v1.9.10 released January 25th, 2016
                     Fixed bug related to masking and lower case database sequences.

              v1.10.0 released February 11th, 2016
                     Parallelized and improved merging of paired-end reads and adjusted some  defaults.  Removed
                     progress  indicator  when  stderr  is  not a terminal. Added --fasta_score option to report
                     chimera scores in FASTA files. Added  --rereplicate  and  --fastq_eestats  commands.  Fixed
                     typos. Added relabelling to files produced with --consout and --profile options.

              v1.10.1 released February 23rd, 2016
                     Fixed  a bug affecting the --fastq_mergepairs command causing FASTQ headers to be truncated
                     at first space (despite the bug fix release 1.9.0 of November 12th, 2015). Full headers are
                     now included in the output (no matter if --notrunclabels is in effect or not).

              v1.10.2 released March 18th, 2016
                     Fixed  a bug causing a segmentation fault when running --usearch_global with an empty query
                     sequence. Also fixed a bug causing imperfect alignments to be reported  with  an  alignment
                     string  of  '='  in  uc output files. Fixed typos in man file. Fixed fasta/fastq processing
                     code regarding presence or absence of compression library header files.

              v1.11.1 released April 13th, 2016
                     Added strand information in  UC  file  for  --derep_fulllength  and  --derep_prefix.  Added
                     expected   errors   (ee)   to   header   of  FASTA  files  specified  with  --fastaout  and
                     --fastaout_discarded when --eeout or --fastq_eeout option is in effect for fastq_filter and
                     fastq_mergepairs. The options --eeout and --fastq_eeout are now equivalent.

              v1.11.2 released June 21st, 2016
                     Two  bugs  were  fixed.  The  first issue was related to the --query_cov option that used a
                     different coverage definition than the qcov userfield. The coverage is now defined  as  the
                     fraction  of  the  whole query sequence length that is aligned with matching or mismatching
                     residues in the target. All gaps are ignored. The other issue was related to the  consensus
                     sequences  produced  during  clustering  when  only  N's  were  present  in some positions.
                     Previously these would be converted to A's in the consensus. The behaviour  is  changed  so
                     that N's are produced in the consensus, and it should now be more compatible with usearch.

              v2.0.0 released June 24th, 2016
                     This   major  new  version  supports  reading  from  pipes.  Two  new  options  are  added:
                     --gzip_decompress and --bzip2_decompress. One of these options must be specified if reading
                     compressed  input  from  a pipe, but are not required when reading from ordinary files. The
                     vsearch header that was previously written to stdout is now written to stderr. This enables
                     piping  of  results  for further processing. The file name '-' now represent standard input
                     (/dev/stdin) or standard output (/dev/stdout) when reading or writing files,  respectively.
                     Code for reading FASTA and FASTQ files has been refactored.

              v2.0.1 released June 30th, 2016
                     Avoid segmentation fault when masking very long sequences.

              v2.0.2 released July 5th, 2016
                     Avoid warnings when compiling with GCC 6.

              v2.0.3 released August 2nd, 2016
                     Fixed bad compiler options resulting in Illegal instruction errors when running precompiled
                     binaries.

              v2.0.4 released September 1st, 2016
                     Improved error message for bad FASTQ quality values. Improved manual.

              v2.0.5 released September 9th, 2016
                     Add options --fastaout_discarded and --fastqout_discarded  to  output  discarded  sequences
                     from subsampling to separate files. Updated manual.

              v2.1.0 released September 16th, 2016
                     New   command:   --fastx_filter.   New   options:  --fastq_maxlen,  --fastq_truncee.  Allow
                     --minwordmatches down to 3.

              v2.1.1 released September 23rd, 2016
                     Fixed bugs in output to UC-files. Improved help text and manual.

              v2.1.2 released September 28th, 2016
                     Fixed incorrect abundance output from fastx_filter and fastq_filter when relabelling.

              v2.2.0 released October 7th, 2016
                     Added OTU table generation options --biomout, --mothur_shared_out and  --otutabout  to  the
                     clustering and searching commands.

              v2.3.0 released October 10th, 2016
                     Allowed zero-length sequences in FASTA and FASTQ files. Added --fastq_trunclen_keep option.
                     Fixed bug with output of OTU tables to pipes.

              v2.3.1 released November 16th, 2016
                     Fixed bug where --minwordmatches 0 was interpreted as the default minimum word matches  for
                     the  given  word  length  instead of zero. When used in combination with --maxaccepts 0 and
                     --maxrejects 0 it will allow complete bypass of kmer-based heuristics.

              v2.3.2 released November 18th, 2016
                     Fixed bug where vsearch reported the ordinal number of the target sequence instead  of  the
                     cluster  number  in  column 2 on H-lines in the uc output file after clustering. For search
                     and alignment commands both usearch and vsearch reports the target sequence number here.

              v2.3.3 released December 5th, 2016
                     A minor speed improvement.

              v2.3.4 released December 9th, 2016
                     Fixed bug in output of sequence profiles and updated documentation.

              v2.4.0 released February 8th, 2017
                     Added support for Linux on  Power8  systems  (ppc64le)  and  Windows  on  x86_64.  Improved
                     detection  of  pipes  when  reading FASTA and FASTQ files. Corrected option for specifiying
                     output from fastq_eestats command in help text.

              v2.4.1 released March 1st, 2017
                     Fixed an overflow bug in fastq_stats and fastq_eestats affecting  analysis  of  very  large
                     FASTQ files. Fixed maximum memory usage reporting on Windows.

              v2.4.2 released March 10th, 2017
                     Default  value  for  fastq_minovlen  increased  to  16 in accordance with help text and for
                     compatibility with usearch. Minor changes for improved accuracy of paired-end read merging.

              v2.4.3 released April 6th, 2017
                     Fixed bug with progress  bar  for  shuffling.  Fixed  missing  N-lines  in  UC  files  with
                     usearch_global,  search_exact  and  allpairs_global  when the output_no_hits option was not
                     specified.

              v2.4.4 released August 28th, 2017
                     Fixed a few minor bugs, improved error messages and updated documentation.

              v2.5.0 released October 5th, 2017
                     Support  for  UDB  database  files.   New   commands:   fastq_stripright,   fastq_eestats2,
                     makeudb_usearch,  udb2fasta,  udbinfo,  and  udbstats. New general option: no_progress. New
                     options minsize and maxsize to fastx_filter. Minor bug fixes,  error  message  improvements
                     and documentation updates.

              v2.5.1 released October 25th, 2017
                     Fixed  bug  with  bad  default  value  of  1  instead of 32 for minseqlength when using the
                     makeudb_usearch command.

              v2.5.2 released October 30th, 2017
                     Fixed bug with where '-' as an argument to the fastq_eestats2 option was treated  literally
                     instead of equivalent to stdin.

              v2.6.0 released November 10th, 2017
                     Rewritten  paired-end  reads  merger  with  improved  accuracy. Decreased default value for
                     fastq_minovlen option from 16 to 10. The default value for  the  fastq_maxdiffs  option  is
                     increased  from  5  to  10. There are now other more important restrictions that will avoid
                     merging reads that cannot be reliably aligned.

              v2.6.1 released December 8th, 2017
                     Improved parallelisation of paired end reads merging.

              v2.6.2 released December 18th, 2017
                     Fixed option xsize that was partially inactive for commands uchime_denovo, uchime_ref,  and
                     fastx_filter.

              v2.7.0 released February 13th, 2018
                     Added  commands  cluster_unoise,  uchime2_denovo  and  uchime3_denovo contributed by Davide
                     Albanese based on Robert Edgar's papers. Refactored fasta and fastq print functions as well
                     as code for extraction of abundance and other attributes from the headers.

              v2.7.1 released February 16th, 2018
                     Fix several bugs on Windows related to large files, use of "-" as a file name to mean stdin
                     or stdout, alignment errors, missed kmers and corrupted UDB files. Added  documentation  of
                     UDB-related commands.