Ubuntu Manpage: vsearch — chimera detection, clustering, dereplication, masking, pairwise alignment,

NAME

       vsearch  —  chimera  detection,  clustering,  dereplication,  masking, pairwise alignment,
       searching, shuffling and sorting of amplicons from metagenomic projects.

SYNOPSIS

       Chimera detection:
              vsearch --uchime_denovo fastafile (--chimeras | --nonchimeras | --uchimealns |
              --uchimeout) outputfile [options]

              vsearch --uchime_ref fastafile (--chimeras | --nonchimeras | --uchimealns |
              --uchimeout) outputfile --db fastafile [options]

       Clustering:
              vsearch (--cluster_fast | --cluster_size | --cluster_smallmem) fastafile (--alnout
              | --blast6out | --centroids | --clusters | --msaout | --samout | --uc | --userout)
              outputfile --id real [options]

       Dereplication:
              vsearch --derep_fulllength fastafile (--output | --uc) outputfile [options]

       Masking:
              vsearch --maskfasta fastafile --output outputfile [options]

       Pairwise alignment:
              vsearch --allpairs_global fastafile (--alnout | --blast6out | --matched |
              --notmatched | --samout | --uc | --userout) outputfile (--acceptall | --id real)
              [options]

       Searching:
              vsearch --usearch_global fastafile --db fastafile (--alnout | --blast6out |
              --samout | --uc | --userout) outputfile --id real [options]

       Shuffling:
              vsearch --shuffle fastafile --output outputfile [options]

       Sorting:
              vsearch (--sortbylength | --sortbysize) fastafile --output outputfile [options]

DESCRIPTION

Environmental or clinical molecular diversity studies generate large volumes of amplicons
(e.g. SSU-rRNA sequences) that need to be checked for chimeras, dereplicated, masked,
sorted, searched, clustered or compared to reference sequences. The aim of vsearch is to
offer a all-in-one open source tool to perform these tasks, using optimized algorithm
implementations and harvesting the full potential of modern computers, thus providing fast
and accurate data processing.

Comparing nucleotide sequences is at the core of vsearch. To speed up comparisons, vsearch
implements an extremely fast implementation of the Needleman-Wunsch algorithm, making use
of the Streaming SIMD Extensions (SSE2) of modern x86-64 CPUs. If SSE2 instructions are
not available, vsearch exits with an error message. For comparisons involving sequences
longer than 5,000 nucleotides, vsearch uses a slower alignment method with smaller memory
requirements.

Input
vsearch input is a fasta file containing one or several nucleotide sequences. For each
sequence, the sequence identifier is defined as the string comprised between the ">"
symbol and the first space, or the end of the line, whichever comes first. Additionally,
if the line starts with ">[;]size=integer;label", contains ">label;size=integer;label" or
ends with ">label;size=integer[;]", vsearch will remove the pattern [;]size=integer[;]
from the header and interpret integer as the number of occurrences (or abundance) of the
sequence in the study. That abundance information is used or created during chimera
detection, clustering, dereplication, sorting and searching.

The nucleotide sequence is defined as a string of IUPAC symbols (ACGTURYSWKMDBHVN),
starting after the end of the identifier line and ending before the next identifier line,
or the file end. vsearch silently ignores ascii characters 9 to 13, and exits with an
error message if ascii characters 0 to 8, 14 to 31, "." or "-" are present. All other
ascii or non-ascii characters are stripped and complained about in a non-blocking warning
message.

vsearch operations are case insensitive, except when soft masking is activated. When using
clustering, masking or searching commands, the case is important if soft masking is used.
Soft masking is specified with the options "--dbmask soft" (for searching) or "--qmask
soft" (for searching, clustering and masking). When using soft masking, lower case letters
indicate masked symbols, while upper case letters indicate regular symbols. Masked symbols
are never included in the unique k-mers used in searching. When soft masking is not
activated, all letters are converted to upper case internally and used in result files.

When comparing sequences during chimera detection, dereplication, searching and
clustering, T and U are considered identical, regardless of their case. If two symbols are
not identical, their alignment will result in the negative mismatch score (default -4),
except if one or both of the symbols are ambiguous (RYSWKMDBHVN) in which case the score
is zero. Alignment of two identical ambiguous symbols (e.g. R vs R) also receives a score
of zero.

vsearch can be compiled to accepted compressed fasta files as input (gz and bzip2
formats). On the other hand, special files like pipes, named pipes, or sockets cannot be
used as input. To present a progress indicator, vsearch needs to seek to the end of
filename to find its length. Consequently, filename must be a regular file, not a stream.

Options
vsearch recognizes a large number of command-line options. For easier navigation, options
are grouped below by theme (chimera detection, clustering, dereplication, masking,
shuffling, sorting, and searching). We start with general options that apply to all
themes.

General options:

--fasta_width positive integer
Fasta files produced by vsearch are wrapped (sequences are written on
lines of integer nucleotides, 80 by default). Set that value to 0 to
eliminate the wrapping.

--help Display a short help and exit.

--log filename
Write messages to the specified log file. Information written includes
program version, amount of memory available, number of cores and command
line options. The start and finish times are also recorded as well as the
elapsed time. The maximum amount of memory consumed is included. The
different commands will usually also write some information about their
results. Both fatal, warning and informational messages are written.

--maxseqlength positive integer
All vsearch operations will discard sequences of length equal or greater
than integer (50,000 nucleotides by default).

--minseqlength positive integer
All vsearch operations will discard sequences of length smaller than
integer (1 nucleotide by default for sorting or shuffling, 32 nucleotides
for clustering, dereplication or searching).

--notrunclabels
Do not truncate sequence labels at first space, use the full header in
output files.

--quiet Suppress all output to stdout and stdout except for warnings and fatal
error messages.

--version
Output version information and exit.

Chimera detection options:

Chimera detection is based on a scoring function controlled by five options (--dn,
--mindiffs, --mindiv, --minh, --xn). Sequences are first sorted by decreasing
abundance (if available), and compared on their plus strand only (case
insensitive).

In de novo mode, input fasta file should present abundance annotations (pattern
[;]size=integer[;] in the fasta header). The input order influences the chimera
detection, so we recommend to sort sequences by decreasing abundance (default of
--derep_fulllength command). If your sequence set needs to be sorted, please see
the --sortbysize command in the sorting section.

--abskew real
When using --uchime_denovo, the abundance skew is used to distinguish in a
3-way alignment which sequence is the chimera and which are the parents.
The assumption is that chimeras appear later in the PCR amplification
process and are therefore less abundant than their parents. The default
value is 2.0, which means that the parents should be at least 2 times more
abundant than their chimera. Any positive value greater than 1.0 can be
used.

--alignwidth positive integer
Width of the 3-way alignments in --uchimealns output. The default value is
80. Set to 0 to eliminate wrapping.

--chimeras filename
Output chimeric sequences to filename, in fasta format. Output order may
vary when using multiple threads.

--db filename
When using --uchime_ref, detect chimeras using the fasta-formatted
reference sequences contained in filename. Reference sequences are assumed
to be chimera-free. Chimeras will not be detected if their parents (or
sufficiently close relatives) are not present in the database.

--dn real
No vote pseudo-count (parameter n in the chimera scoring function)
(default value is 1.4).

--mindiffs positive integer
Minimum number of differences per segment (default value is 3).

--mindiv real
Minimum divergence from closest parent (default value is 0.8).

--minh real
Minimum score (h). Increasing this value tends to reduce the number of
false positives and to decrease sensitivity. Default value is 0.28, and
values ranging from 0.0 to 1.0 included are accepted.

--nonchimeras filename
Output non-chimeric sequences to filename, in fasta format. Output order
may vary when using multiple threads.

--self When using --uchime_ref, ignore a reference sequence when its label
matches the label of the query sequence (useful to estimate false-positive
rate in reference sequences).

--selfid When using --uchime_ref, ignore a reference sequence when its nucleotide
sequence is strictly identical with the query sequence.

--threads positive integer
Number of computation threads to use (1 to 256) with --uchime_ref. The
number of threads should be lesser or equal to the number of available CPU
cores. The default is to use all available resources and to launch one
thread per logical core.

--uchime_denovo filename
Detect chimeras present in the fasta-formatted filename, without external
references (i.e. de novo). Automatically sort the sequences in filename by
decreasing abundance beforehand (see the sorting section for details).
Multithreading is not supported.

--uchime_ref filename
Detect chimeras present in the fasta-formatted filename by comparing them
with reference sequences (option --db). Multithreading is supported.

--uchimealns filename
Write the 3-way global alignments (parentA, parentB, chimera) to filename
using a human-readable format. Use --alignwidth to modify alignment
length. Output order may vary when using multiple threads.

--uchimeout filename
Write chimera detection results to filename using the uchime tab-separated
format of 18 fields (see the list below). Use --uchimeout5 to use a format
compatible with usearch v5 and earlier versions. Rows output order may
vary when using multiple threads.

1. score: higher score means a more likely chimeric alignment.

2. Q: query sequence label.

3. A: parent A sequence label.

4. B: parent B sequence label.

5. T: top parent sequence label (i.e. parent most similar to the
query). That field is removed when using --uchimeout5.

6. idQM: percentage of similarity of query (Q) and model (M)
constructed as a part of parent A and a part of parent B.

7. idQA: percentage of similarity of query (Q) and parent A.

8. idQB: percentage of similarity of query (Q) and parent B.

9. idAB: percentage of similarity of parent A and parent B.

10. idQT: percentage of similarity of query (Q) and top parent (T).

11. LY: yes votes in the left part of the model.

12. LN: no votes in the left part of the model.

13. LA: abstain votes in the left part of the model.

14. RY: yes votes in the right part of the model.

15. RN: no votes in the right part of the model.

16. RA: abstain votes in the right part of the model.

17. div: divergence, defined as (idQM - idQT).

18. YN: query is chimeric (Y), or not (N), or is a borderline case
(?).

--uchimeout5
When using --uchimeout, write chimera detection results using a tab-
separated format of 17 fields (drop the 5th field of --uchimeout),
compatible with usearch version 5 and earlier versions.

--xn real
No vote weight (parameter beta in the scoring function) (default value is
8.0).

Clustering options:

vsearch implements a single-pass, greedy star-clustering algorithm, similar to the
algorithms implemented in usearch, DNAclust and sumaclust for example. Important
parameters are the global clustering threshold (--id) and the pairwise identity
definition (--iddef).

--centroids filename
Output cluster centroid sequences to filename, in fasta format. The
centroid is the sequence that seeded the cluster (i.e. the first sequence
of the cluster).

--cluster_fast filename
Clusterize the fasta sequences in filename, automatically perform a
sorting by decreasing sequence length beforehand.

--cluster_size filename
Clusterize the fasta sequences in filename, automatically perform a
sorting by decreasing sequence abundance beforehand.

--cluster_smallmem filename
Clusterize the fasta sequences in filename without automatically modifying
their order beforehand. Sequence are expected to be sorted by decreasing
sequence length, unless --usersort is used.

--clusters string
Output each cluster to a separate fasta file using the prefix string and a
ticker (0, 1, 2, etc.) to construct the path and filenames.

--consout filename
Output cluster consensus sequences to filename. For each cluster, a
multiple alignment is computed, and a consensus sequence is constructed by
taking the majority symbol (nucleotide or gap) from each column of the
alignment. Columns containing a majority of gaps are skipped, except for
terminal gaps.

--id real
Do not add the target to the cluster if the pairwise identity with the
centroid is lower than real (value ranging from 0.0 to 1.0 included). The
pairwise identity is defined as the number of (matching columns) /
(alignment length - terminal gaps). That definition can be modified by
--iddef.

--iddef 0|1|2|3|4
Change the pairwise identity definition used in --id. Values accepted are:

0. CD-HIT definition: (matching columns) / (shortest sequence
length).

1. edit distance: (matching columns) / (alignment length).

2. edit distance excluding terminal gaps (same as --id).

3. Marine Biological Lab definition counting each extended gap
(internal or terminal) as a single difference: 1.0 -
[(mismatches + gaps)/(longest sequence length)]

4. BLAST definition, equivalent to --iddef 2 in a context of
global pairwise alignment.

--msaout filename
Output a multiple sequence alignment and a consensus sequence for each
cluster to filename, in fasta format. The consensus sequence is
constructed by taking the majority symbol (nucleotide or gap) from each
column of the alignment. Columns containing a majority of gaps are
skipped, except for terminal gaps.

--qmask none|dust|soft
Mask simple repeats and low-complexity regions in sequences using the dust
or the soft algorithms, or do not mask (none). Warning, when using soft
masking, clustering becomes case sensitive. The default is to mask using
dust.

--sizein Take into account the abundance annotations present in the input fasta
file (search for the pattern "[>;]size=integer[;]" in sequence headers).

--sizeout
Add abundance annotations to the output fasta files (add the pattern
";size=integer;" to sequence headers). If --sizein is specified, abundance
annotations are reported to output files, and each cluster centroid
receives a new abundance value corresponding to the total abundance of the
amplicons included in the cluster (--centroids option). If --sizein is not
specified, input abundances are set to 1 for amplicons, and to the number
of amplicons per cluster for centroids.

--strand plus|both
When comparing sequences with the cluster seed, check the plus strand only
(default) or check both strands.

--threads positive integer
Number of computation threads to use (1 to 256). The number of threads
should be lesser or equal to the number of available CPU cores. The
default is to use all available resources and to launch one thread per
logical core.

--uc filename
Output clustering results in filename using a uclust-like format. For a
description of the format, see
<http://www.drive5.com/usearch/manual/ucout.html>.

--usersort
When using --cluster_smallmem, allow any sequence input order, not just a
decreasing length ordering.

Most searching options also apply to clustering:
--alnout, --blast6out, --fastapairs, --matched, --notmatched, --maxaccept,
--maxreject, --samout, --userout, --userfields, score filtering, gap
penalties, masking. (see the Searching section).

Dereplication options:

--derep_fulllength filename
Merge strictly identical sequences contained in filename. Identical
sequences are defined as having the same length and the same string of
nucleotides (case insensitive, T and U are considered the same).

--maxuniquesize positive integer
Discard sequences with an abundance value greater than integer.

--minuniquesize positive integer
Discard sequences with an abundance value smaller than integer.

--output filename
Write the dereplicated sequences to filename, in fasta format and sorted
by decreasing abundance. Identical sequences receive the header of the
first sequence of their group. If --sizeout is used, the number of
occurrences (i.e. abundance) of each sequence is indicated at the end of
their fasta header using the pattern ";size=integer;".

--sizein Take into account the abundance annotations present in the input fasta
file (search for the pattern "[>;]size=integer[;]" in sequence headers).

--sizeout
Add abundance annotations to the output fasta file (add the pattern
";size=integer;" to sequence headers). If --sizein is specified, each
unique sequence receives a new abundance value corresponding to its total
abundance (sum of the abundances of its occurrences). If --sizein is not
specified, input abundances are set to 1, and each unique sequence
receives a new abundance value corresponding to its number of occurrences
in the input file.

--strand plus|both
When searching for strictly identical sequences, check the plus strand
only (default) or check both strands.

--topn positive integer
Output only the top integer sequences (i.e. the most abundant).

--uc filename
Output dereplication results in filename using a uclust-like format. For a
description of the format, see
<http://www.drive5.com/usearch/manual/ucout.html>. In the context of
dereplication, the option --uc_allhits has no effect on the --uc output.

Masking options:

An input sequence can be composed of lower- or uppercase nucleotides. Lowercase
nucleotides are silently set to uppercase before masking, unless the --qmask soft
option is used. Here are the results of combined masking options --qmask (or
--dbmask for database sequences) and --hardmask, assuming each input sequences
contains both lower and uppercase nucleotides:

qmask hardmask action
───────────────────────────────────────────────────────────────────
none off no masking, all symbols uppercased
none on no masking, all symbols uppercased
dust off masked symbols lowercased, others uppercased
dust on masked symbols changed to Ns, others uppercased
soft off lowercase symbols masked, no case changes
soft on lowercase symbols masked and changed to Ns

--hardmask
Mask low-complexity regions by replacing them with Ns instead of setting
them to lower case.

--maskfasta filename
Mask simple repeats and low-complexity regions in sequences contained in
filename. The default is to mask using dust (use --qmask to modify that
behavior).

--output filename
Write the masked sequences to filename, in fasta format.

--qmask none|dust|soft
Mask simple repeats and low-complexity regions in sequences using the dust
or the soft algorithms, or do not mask (none). The default is to mask
using dust.

Pairwise alignment options:

The results of the n * (n - 1) / 2 pairwise alignments are written to the result
files specified with --alnout, --blast6out, --fastapairs --matched, --notmatched,
--samout, --uc or --userout (see Searching section below). Specify either the
--acceptall option to output all pairwise alignments, or specify an identity level
with --id to discard weak alignments. Most other accept/reject options (see
Searching options below) may also be used. Sequences are aligned on their plus
strand only.

--acceptall
Write the results of all alignments to output files. This option overrides
all other accept/reject options (including --id).

--allpairs_global filename
Perform optimal global pairwise alignments of all vs. all fasta sequences
contained in filename. This command is multi-threaded.

--id real
Reject the sequence match if the pairwise identity is lower than real
(value ranging from 0.0 to 1.0 included).

Searching options:

--alnout filename
Write pairwise global alignments to filename using a human-readable
format. Use --rowlen to modify alignment length. Output order may vary
when using multiple threads.

--blast6out filename
Write search results to filename using a blast-like tab-separated format
of twelve fields (listed below), with one line per query-target matching
(or lack of matching if --output_no_hits is used). Output order may vary
when using multiple threads. A similar output can be obtain with --userout
filename and --userfields
query+target+id+alnlen+mism+opens+qlo+qhi+tlo+thi+evalue+bits. A complete
list and description is available in the section "Userfields" of this
manual.

1. query: query label.

2. target: target (database sequence) label. The field is set to
"*" if there is no alignment.

3. id: percentage of identity (real value ranging from 0.0 to
100.0). The percentage identity is defined as 100 * (matching
columns) / (alignment length - terminal gaps). See fields id0
to id4 for other definitions.

4. alnlen: length of the query-target alignment (number of
columns). The field is set to 0 if there is no alignment.

5. mism: number of mismatches in the alignment (zero or positive
integer value).

6. opens: number of columns containing a gap opening (zero or
positive integer value).

7. qlo: first nucleotide of the query aligned with the target.
Always equal to 1 if there is an alignment, 0 otherwise.

8. qhi: last nucleotide of the query aligned with the target.
Always equal to the length of the pairwise alignment. The field
is set to 0 if there is no alignment.

9. tlo: irst nucleotide of the target aligned with the query.
Always equal to 1 if there is an alignment, 0 otherwise.

10. thi: last nucleotide of the target aligned with the query.
Always equal to the length of the pairwise alignment. The field
is set to 0 if there is no alignment.

11. evalue: expectancy-value (not computed for nucleotide
alignments). Always set to -1.

12. bits: bit score (not computed for nucleotide alignments).
Always set to 0.

--db filename
Compare query sequences (specified with --usearch_global) to the fasta-
formatted target sequences contained in filename, using global pairwise
alignment.

--dbmask none|dust|soft
Mask simple repeats and low-complexity regions in target database
sequences using the dust or the soft algorithms, or do not mask (none).
Warning, when using soft masking search commands become case sensitive.
The default is to mask using dust.

--dbmatched filename
Write database target sequences matching at least one query sequence to
filename, in fasta format. If the option --sizeout is used, the number of
queries that matched each target sequence is indicated using the pattern
";size=integer;".

--dbnotmatched filename
Write database target sequences not matching query sequences to filename,
in fasta format.

--fastapairs filename
Write pairwise alignments of query and target sequences to filename, in
fasta format.

--fulldp Dummy option for compatibility with usearch. To maximize search
sensitivity, vsearch uses a 8-way 16-bit SIMD vectorized full dynamic
programming algorithm (Needleman-Wunsch), whether or not --fulldp is
specified.

--gapext string
Set penalties for a gap extension. See --gapopen for a complete
description of the penalty declaration system. The default is to
initialize the six gap extending penalties using a penalty of 2 for
extending internal gaps and a penalty of 1 for extending terminal gaps, in
both query and target sequences (i.e. 2I/1E).

--gapopen string
Set penalties for a gap opening. A gap opening can occur in six different
contexts: in the query (Q) or in the target (T) sequence, at the left (L)
or right (R) extremity of the sequence, or inside the sequence (I).
Sequence symbols (Q and T) can be combined with location symbols (L, I,
and R), and numerical values to declare penalties for all possible
contexts: aQL/bQI/cQR/dTL/eTI/fTR, where abcdef are zero or positive
integers, and "/" is used as a separator.
To simplify declarations, the location symbols (L, I, and R) can be
combined, the symbol (E) can be used to treat both extremities (L and R)
equally, and the symbols Q and T can be omitted to treat query and target
sequences equally. For instance, the default is to declare a penalty of 20
for opening internal gaps and a penalty of 2 for opening terminal gaps
(left or right), in both query and target sequences (i.e. 20I/2E). If only
a numerical value is given, without any sequence or location symbol, then
the penalty applies to all gap openings. To forbid gap-opening, an
infinite penalty value can be declared with the symbol "*". To use vsearch
as a semi-global aligner, a null-penalty can be applied to the left (L) or
right (R) gaps.
vsearch always initializes the six gap opening penalties using the default
parameters (20I/2E). The user is then free to declare only the values
he/she wants to modify. The string is scanned from left to right, accepted
symbols are (0123456789/LIREQT*), and later values override previous
values.
Please note that vsearch, in contrast to usearch, only allows integer gap
penalties. Because the lowest gap penalties are 0.5 by default in usearch,
all default scores and gap penalties in vsearch have been doubled to
maintain equivalent penalties and to produce identical alignments.

--hardmask
Mask low-complexity regions by replacing them with Ns instead of setting
them to lower case. For more information, please see the Masking section.

--id real
Reject the sequence match if the pairwise identity is lower than real
(value ranging from 0.0 to 1.0 included). The search process sorts target
sequences by decreasing number of k-mers they have in common with the
query sequence, using that information as a proxy for sequence similarity.
That efficient pre-filtering will also prevent pairwise alignments with
weakly matching targets, as there needs to be at least 6 shared k-mers to
start the pairwise alignment, and at least one out of every 16 k-mers from
the query needs to match the target. Consequently, using values lower than
--id 0.5 is not likely to capture more weakly matching targets. The
pairwise identity is by default defined as the number of (matching
columns) / (alignment length - terminal gaps). That definition can be
modified by --iddef.

--iddef 0|1|2|3|4
Change the pairwise identity definition used in --id. Values accepted are:

0. CD-HIT definition: (matching columns) / (shortest sequence
length).

1. edit distance: (matching columns) / (alignment length).

2. edit distance excluding terminal gaps (same as --id).

3. Marine Biological Lab definition counting each extended gap
(internal or terminal) as a single difference: 1.0 -
[(mismatches + gaps)/(longest sequence length)]

4. BLAST definition, equivalent to --iddef 2 in a context of
global pairwise alignment.

The option --userfields accepts the fields id0 to id4, in addition to the
field id, to report the pairwise identity values corresponding to the
different definitions.

--idprefix positive integer
Reject the sequence match if the first integer nucleotides of the target
do not match the query.

--idsuffix positive integer
Reject the sequence match if the last integer nucleotides of the target do
not match the query.

--leftjust
Reject the sequence match if the pairwise alignment begins with gaps.

--match integer
Score assigned to a match (i.e. identical nucleotides) in the pairwise
alignment. The default value is 2.

--matched filename
Write query sequences matching database target sequences to filename, in
fasta format.

--maxaccepts positive integer
Maximum number of hits to accept before stopping the search. The default
value is 1. This option works in pair with --maxrejects. The search
process sorts target sequences by decreasing number of k-mers they have in
common with the query sequence, using that information as a proxy for
sequence similarity. After pairwise alignments, if the first target
sequence passes the acceptation criteria, it is accepted as best hit and
the search process stops for that query. If --maxaccepts is set to a
higher value, more hits are accepted. If --maxaccepts and --maxrejects are
both set to 0, the complete database is searched.

--maxdiffs positive integer
Reject the sequence match if the alignment contains at least integer
substitutions, insertions or deletions.

--maxgaps positive integer
Reject the sequence match if the alignment contains at least integer
insertions or deletions.

--maxhits positive integer
Maximum number of hits to show once the search is terminated (hits are
sorted by decreasing identity). Unlimited by default. That option applies
to --alnout, --blast6out, --fastapairs, --samout, --uc, or --userout
output files.

--maxid real
Reject the sequence match if the percentage of identity between the two
sequences is greater than real.

--maxqsize positive integer
Reject query sequences with an abundance greater than integer.

--maxqt real
Reject if the query/target sequence length ratio is greater than real.

--maxrejects positive integer
Maximum number of non-matching target sequences to consider before
stopping the search. The default value is 32. This option works in pair
with --maxaccepts. The search process sorts target sequences by decreasing
number of k-mers they have in common with the query sequence, using that
information as a proxy for sequence similarity. After pairwise alignments,
if none of the first 32 examined target sequences pass the acceptation
criteria, the search process stops for that query (no hit). If
--maxrejects is set to a higher value, more target sequences are
considered. If --maxaccepts and --maxrejects are both set to 0, the
complete database is searched.

--maxsizeratio real
Reject if the query/target abundance ratio is greater than real.

--maxsl real
Reject if the shorter/longer sequence length ratio is greater than real.

--maxsubs positive integer
Reject the sequence match if the pairwise alignment contains more than
integer substitutions.

--mid real
Reject the sequence match if the percentage of identity is lower than real
(ignoring all gaps, internal and terminal).

--mincols positive integer
Reject the sequence match if the alignment length is shorter than integer.

--minqt real
Reject if the query/target sequence length ratio is lower than real.

--minsizeratio real
Reject if the query/target abundance ratio is lower than real.

--minsl real
Reject if the shorter/longer sequence length ratio is lower than real.

--mintsize positive integer
Reject target sequences with an abundance lower than integer.

--mismatch integer
Score assigned to a mismatch (i.e. different nucleotides) in the pairwise
alignment. The default value is -4.

--notmatched filename
Write query sequences not matching database target sequences to filename,
in fasta format.

--output_no_hits
Write both matching and non-matching queries to --alnout, --blast6out,
--samout or --userout output files (--uc and --uc_allhits output files
always feature non-matching queries). Non-matching queries are labelled
"No hits" in --alnout files.

--qmask none|dust|soft
Mask simple repeats and low-complexity regions in query sequences using
the dust or the soft algorithms, or do not mask (none). Warning, when
using soft masking search commands become case sensitive. The default is
to mask using dust.

--query_cov real
Reject if the fraction of the query aligned to the target sequence is
lower than real. The query coverage is computed as (matches + mismatches)
/ query sequence length. Internal or terminal gaps are not taken into
account.

--rightjust
Reject the sequence match if the pairwise alignment ends with gaps.

--rowlen positive integer
Width of alignment lines in --alnout output. The default value is 64. Set
to 0 to eliminate wrapping.

--samout filename
Write alignment results to filename in the SAM format. For a description
of the format, see <https://github.com/samtools/hts-specs>. Output order
may vary when using multiple threads.

--self Reject the sequence match if the query and target labels are identical.

--selfid Reject the sequence match if the query and target sequences are strictly
identical.

--sizeout
Add abundance annotations to the output of the option --dbmatched (using
the pattern ";size=integer;"), to report the number of queries that
matched each target.

--strand plus|both
When searching for similar sequences, check the plus strand only (default)
or check both strands.

--target_cov real
Reject the sequence match if the fraction of the target sequence aligned
to the query sequence is lower than real. The target coverage is computed
as (matches + mismatches) / target sequence length. Internal or terminal
gaps are not taken into account.

--top_hits_only
Output only the hits with the highest percentage of identity with the
query.

--uc filename
Output searching results in filename using a uclust-like format. For a
description of the format, see
<http://www.drive5.com/usearch/manual/ucout.html>. Output order may vary
when using multiple threads.

--uc_allhits
When using the --uc option, show all hits, not just the top hit for each
query.

--usearch_global filename
Compare target sequences (--db) to the fasta-formatted query sequences
contained in filename, using global pairwise alignment.

--userfields string
When using --userout, select and order the fields written to the output
file. Fields are separated by "+" (e.g. query+target+id). See the
"Userfields" section for a complete list of fields.

--userout filename
Write user-defined tab-separated output to filename. Select the fields
with the option --userfields. Output order may vary when using multiple
threads. If --userfields is empty or not present, filename is empty.

--weak_id real
Show hits with percentage of identity of at least real, without
terminating the search. A normal search stops as soon as enough hits are
found (as defined by --maxaccepts, --maxrejects, and --id). As --weak_id
reports weak hits that are not deduced from --maxaccepts, high --id values
can be used, hence preserving both speed and sensitivity. Logically, real
must be smaller than the value indicated by --id.

--wordlength positive integer
Length of words (i.e. k-mers) for database indexing. The range of possible
values goes from 3 to 15, but values near 8 are generally recommended.
Longer words may reduce the sensitivity for weak similarities, but can
increase accuracy. On the other hand, shorter words may increase
sensitivity, but can reduce accuracy. Computation time will generally
increase with shorter words and decrease with longer words. Memory
requirements for a part of the index increase with a factor of 4 each time
word length increases by one nucleotide, and this generally becomes
significant for long words (12 or more). The default value is 8.

Shuffling options:

--output filename
Write the shuffled sequences to filename, in fasta format.

--seed positive integer
When shuffling sequence order, use integer as seed. A given seed will
always produce the same output order (useful for replicability). Set to 0
to use a pseudo-random seed (default behavior).

--shuffle filename
Pseudo-randomly shuffle the order of sequences contained in filename.

--topn positive integer
Output only the top integer sequences.

Sorting options:
Fasta entries are sorted by decreasing abundance (--sortbysize) or sequence length
(--sortbylength). To obtain a stable sorting order, ties are sorted by decreasing
abundance and label increasing alpha-numerical order (--sortbylength), or just by
label increasing alpha-numerical order (--sortbysize). Label sorting assumes that
all sequences have unique labels. The same applies to the automatic sorting
performed during chimera checking (--uchime_denovo), dereplication
(--derep_fulllength), and clustering (--cluster_fast and --cluster_size).

--maxsize positive integer
When using --sortbysize, discard sequences with an abundance value greater
than integer.

--minsize positive integer
When using --sortbysize, discard sequences with an abundance value smaller
than integer.

--output filename
Write the sorted sequences to filename, in fasta format.

--relabel string
Relabel sequence using the prefix string and a ticker (1, 2, 3, etc.) to
construct the new headers. Use --sizeout to conserve the abundance
annotations.

--sizeout
When using --relabel, report abundance annotations to the output fasta
file (using the pattern ";size=integer;").

--sortbylength filename
Sort by decreasing length the sequences contained in filename. See the
general options --minseqlength and --maxseqlength to eliminate short and
long sequences.

--sortbysize filename
Sort by decreasing abundance the sequences contained in filename (the
pattern "[>;]size=integer[;]" has to be present). See the options
--minsize and --maxsize to eliminate rare and dominant sequences.

--topn positive integer
Output only the top integer sequences (i.e. the longest or the most
abundant).

Userfields (fields accepted by the --userfields option):

aln Print a string of M (match), D (delete, i.e. a gap in the query) and I
(insert, i.e. a gap in the target) representing the pairwise alignment.
Empty field if there is no alignment.

alnlen Print the length of the query-target alignment (number of columns). The
field is set to 0 if there is no alignment.

bits Bit score (not computed for nucleotide alignments). Always set to 0.

caln Compact representation of the pairwise alignment using the CIGAR format
(Compact Idiosyncratic Gapped Alignment Report): M (match), D (deletion)
and I (insertion). Empty field if there is no alignment.

evalue E-value (not computed for nucleotide alignments). Always set to -1.

exts Number of columns containing a gap extension (zero or positive integer
value).

gaps Number of columns containing a gap (zero or positive integer value).

id Percentage of identity (real value ranging from 0.0 to 100.0). The
percentage identity is defined as 100 * (matching columns) / (alignment
length - terminal gaps).

id0 CD-HIT definition of the percentage of identity (real value ranging from
0.0 to 100.0) using the length of the shortest sequence in the pairwise
alignment as denominator: 100 * (matching columns) / (shortest sequence
length).

id1 The percentage of identity (real value ranging from 0.0 to 100.0) is
defined as the edit distance: 100 * (matching columns) / (alignment
length).

id2 The percentage of identity (real value ranging from 0.0 to 100.0) is
defined as the edit distance, excluding terminal gaps. The field id2 is an
alias for the field id.

id3 Marine Biological Lab definition of the percentage of identity (real value
ranging from 0.0 to 100.0), counting each extended gap (internal or
terminal) as a single difference and using the length of the longest
sequence in the pairwise alignment as denominator: 100 * (1.0 -
[(mismatches + gaps) / (longest sequence length)]).

id4 BLAST definition of the percentage of identity (real value ranging from
0.0 to 100.0), equivalent to --iddef 2 in a context of global pairwise
alignment.

ids Number of matches in the alignment (zero or positive integer value).

mism Number of mismatches in the alignment (zero or positive integer value).

opens Number of columns containing a gap opening (zero or positive integer
value).

pairs Number of columns containing only nucleotides. That value corresponds to
the length of the alignment minus the gap-containing columns (zero or
positive integer value).

pctgaps Number of columns containing gaps expressed as a percentage of the
alignment length (real value ranging from 0.0 to 100.0).

pctpv Percentage of positive columns. When working with nucleotide sequences,
this is equivalent to the percentage of matches (real value ranging from
0.0 to 100.0).

pv Number of positive columns. When working with nucleotide sequences, this
is equivalent to the number of matches (zero or positive integer value).

qcov Fraction of the query sequence that is aligned with the target sequence
(real value ranging from 0.0 to 100.0). The query coverage is computed as
100.0 * (matches + mismatches) / query sequence length. Internal or
terminal gaps are not taken into account. The field is set to 0.0 if there
is no alignment.

qframe Query frame (-3 to +3). That field only concerns coding sequences and is
not computed by vsearch. Always set to +0.

qhi Last nucleotide of the query aligned with the target. Always equal to the
length of the pairwise alignment. The field is set to 0 if there is no
alignment.

qihi Last nucleotide of the query aligned with the target (ignoring terminal
gaps). Nucleotide numbering starts from 1. The field is set to 0 if there
is no alignment.

qilo First nucleotide of the query aligned with the target (ignoring initial
gaps). Nucleotide numbering starts from 1. The field is set to 0 if there
is no alignment.

ql Query sequence length (positive integer value). The field is set to 0 if
there is no alignment.

qlo First nucleotide of the query aligned with the target. Always equal to 1
if there is an alignment, 0 otherwise.

qrow Print the sequence of the query segment as seen in the pairwise alignment
(i.e. with gap insertions if need be). Empty field if there is no
alignment.

qs Query segment length. Always equal to query sequence length.

qstrand Query strand orientation (+ or - for nucleotide sequences). Empty field if
there is no alignment.

query Query label.

raw Raw alignment score (negative, null or positive integer value). The score
is the sum of match rewards minus mismatch penalties, gap openings and gap
extensions. The field is set to 0 if there is no alignment.

target Target label. The field is set to "*" if there is no alignment.

tcov Fraction of the target sequence that is aligned with the query sequence
(real value ranging from 0.0 to 100.0). The target coverage is computed as
100.0 * (matches + mismatches) / target sequence length. Internal or
terminal gaps are not taken into account. The field is set to 0.0 if
there is no alignment.

tframe Target frame (-3 to +3). That field only concerns coding sequences and is
not computed by vsearch. Always set to +0.

thi Last nucleotide of the target aligned with the query. Always equal to the
length of the pairwise alignment. The field is set to 0 if there is no
alignment.

tihi Last nucleotide of the target aligned with the query (ignoring terminal
gaps). Nucleotide numbering starts from 1. The field is set to 0 if there
is no alignment.

tilo First nucleotide of the target aligned with the query (ignoring initial
gaps). Nucleotide numbering starts from 1. The field is set to 0 if there
is no alignment.

tl Target sequence length (positive integer value). The field is set to 0 if
there is no alignment.

tlo First nucleotide of the target aligned with the query. Always equal to 1
if there is an alignment, 0 otherwise.

trow Print the sequence of the target segment as seen in the pairwise alignment
(i.e. with gap insertions if need be). Empty field if there is no
alignment.

ts Target segment length. Always equal to target sequence length. The field
is set to 0 if there is no alignment.

tstrand Target strand orientation (+ or - for nucleotide sequences). Always set to
"+", so reverse strand matches have tstrand "+" and qstrand "-". Empty
field if there is no alignment.

DELIBERATE CHANGES

If you are a usearch user, our objective is to make you feel at home. That's why vsearch
was designed to behave like usearch, to some extent. Like any complex software, usearch is
not free from quirks and inconsistencies. We decided not to reproduce some of them, and
for complete transparency, to document here the deliberate changes we made.

During a search with usearch, when using the options --blast6out and --output_no_hits, for
queries with no match the number of fields reported is 13, where it should be 12. This is
corrected in vsearch.

The field raw of the --userfields option is not informative in usearch. This is corrected
in vsearch.

The fields qlo, qhi, tlo, thi now have counterparts (qilo, qihi, tilo, tihi) reporting
alignment coordinates ignoring terminal gaps.

In usearch, when using the option --output_no_hits, queries that receive no match are
reported in blast6out file, but not in the alignment output file. This is corrected in
vsearch.

vsearch introduces a new --cluster_size command that sorts sequences by decreasing
abundance before clustering.

vsearch reintroduces --iddef alternative pairwise identity definitions that were removed
from usearch.

vsearch extends the --topn option to sorting commands.

vsearch extends the --sizein option to dereplication (--derep_fulllength) and clustering
(--cluster_fast).

vsearch treats T and U as identical nucleotides during dereplication.

vsearch sorting is stabilized by using sequence abundances or sequences labels as
secondary or tertiary keys.

NOVELTIES

       vsearch introduces new options not present  in  usearch  7.  They  are  described  in  the
       "Options" section of this manual. Here is a short list:

       - alignwidth (chimera checking)

       - cluster_size (clustering)

       - fasta_width (general option)

       - iddef (clustering, pairwise alignment, searching)

       - maxuniquesize (dereplication)

       - shuffle (shuffling)

EXAMPLES

       Align all sequences in a database with each other and output all pairwise alignments:

              vsearch --allpairs_global database.fas --alnout results.aln --acceptall

       Check  for  the  presence of chimeras (de novo); parents should be at least 1.5 times more
       abundant than chimeras. Output non-chimeric sequences in fasta format (no wrapping):

              vsearch  --uchime_denovo  queries.fas  --nonchimeras  results.fas  --fasta_width  0
              --abskew 1.5

       Cluster  with  a  97%  similarity  threshold, collect cluster centroids, and write cluster
       descriptions using a uclust-like format:

              vsearch  --cluster_fast  queries.fas  --id  0.97  --centroids  centroids.fas   --uc
              clusters.uc

       Dereplicate  the  sequences  contained  in  queries.fas,  take  into account the abundance
       information already present, write unwrapped sequences to output with  the  new  abundance
       information, discard all sequences with an abundance of 1:

              vsearch   --derep_fulllength   queries.fas   --output  queries_masked.fas  --sizein
              --sizeout --fasta_width 0 --minuniquesize 2

       Mask simple repeats and low complexity regions in the input fasta file (masked regions are
       lowercased), and write the results to the output file:

              vsearch --maskfasta queries.fas --output queries_masked.fas --qmask dust

       Search  queries  in  a  reference database, with a 80%-similarity threshold, take terminal
       gaps into account when calculating pairwise similarities:

              vsearch --usearch_global queries.fas --db references.fas --alnout results.aln  --id
              0.8 --iddef 1

       Search a sequence dataset against itself (ignore self hits), get all matches with at least
       60% identity, and collect results in a blast-like tab-separated format:

              vsearch --usearch_global queries.fas --db queries.fas --id 0.6  --self  --blast6out
              results.blast6 --maxaccepts 0 --maxrejects 0

       Shuffle  the  input  fasta  file  (change  the order of sequences) in a repeatable fashion
       (fixed seed), and write unwrapped fasta sequences to the output file:

              vsearch --shuffle queries.fas --output queries_shuffled.fas --seed 13 --fasta_width
              0

       Sort   by   decreasing  abundance  the  sequences  contained  in  queries.fas  (using  the
       "size=integer"  information),  relabel  the  sequences  while  preserving  the   abundance
       information  (with  --sizeout),  keep only sequences with an abundance equal to or greater
       than 2:

              vsearch --sortbysize queries.fas  --output  queries_sorted.fas  --relabel  sampleA_
              --sizeout --minsize 2

AUTHORS

       Implementation by Torbjørn Rognes and Tomás Flouri, documentation by Frédéric Mahé.

REPORTING BUGS

       Submit suggestions and bug-reports at <https://github.com/torognes/vsearch/issues>, send a
       pull  request  on  <https://github.com/torognes/vsearch>,  or  compose   a   friendly   or
       curmudgeont e-mail to Torbjørn Rognes <torognes@ifi.uio.no>.

AVAILABILITY

       Source code and binaries are available at <https://github.com/torognes/vsearch>.

COPYRIGHT

       Copyright (C) 2014, 2015 Torbjørn Rognes, Frédéric Mahé and Tomás Flouri.

       This program is free software: you can redistribute it and/or modify it under the terms of
       the GNU Affero General Public License as published by the Free Software Foundation, either
       version 3 of the License, or any later version.

       This  program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
       without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR  PURPOSE.
       See the GNU Affero General Public License for more details.

       You  should  have received a copy of the GNU Affero General Public License along with this
       program.  If not, see <http://www.gnu.org/licenses/>.

       vsearch includes code from Google's CityHash project by Geoff Pike and  Jyrki  Alakuijala,
       providing some excellent hash functions available under a MIT license.

       vsearch includes code derived from Tatusov and Lipman's DUST program that is in the public
       domain.

       vsearch binaries may include code from the zlib library, copyright  Jean-Loup  Gailly  and
       Mark Adler.

       vsearch binaries may include code from the bzip2 library, copyright Julian R. Seward.

VERSION HISTORY

New features and important modifications of vsearch (short lived or minor bug releases may
not be mentioned):

v1.0.0 released November 28th, 2014
First public release.

v1.0.1 released December 1st, 2014
Bug fixes (sortbysize, semicolon after size annotation in headers) and minor
changes (labels as secondary sort key for most sorts, treat T and U as
identical for dereplication, only output size in dbmatched file if sizeout
specified).

v1.0.2 released December 6th, 2014
Bug fixes (ssse3/sse4.1 requirement, memory leak).

v1.0.3 released December 6th, 2014
Bug fix (now writes help to stdout instead of stderr).

v1.0.4 released December 8th, 2014
Added --allpairs_global option. Reduced memory requirements slightly.
Removed memory leaks.

v1.0.5 released December 9th, 2014
Fixes a minor bug with --allpairs_global and --acceptall options.

v1.0.6 released December 14th, 2014
Fixes a memory allocation bug in chimera detection (--uchime_ref option).

v1.0.7 released December 19th, 2014
Fixes a bug in the output from chimera detection with the --uchimeout
option.

v1.0.8 released January 22nd, 2015
Introduces several changes and bug fixes:

- a new linear memory aligner for alignment of sequences longer than 5,000
nucleotides,

- a new --cluster_size command that sorts sequences by decreasing abundance
before clustering,

- meaning of userfields qlo, qhi, tlo, thi changed for compatibility with
usearch,

- new userfields qilo, qihi, tilo, tihi gives alignment coordinates ignoring
terminal gaps,

- in --uc output files, a perfect alignment is indicated with a "=" sign,

- the option --cluster_fast will now sort sequences by decreasing length,
then by decreasing abundance and finally by sequence identifier,

- default --maxseqlength value set to 50,000 nucleotides,

- fix for bug in alignment in rare cases,

- fix for lack of detection of under- or overflow in SIMD aligner.

v1.0.9 released January 22nd, 2015
Fixes a bug in the function sorting sequences by decreasing abundance
(--sortbysize).

v1.0.10 released January 23rd, 2015
Fixes a bug where the sizein option was ignored and always treated as on,
affecting clustering and dereplication commands.

v1.0.11 released February 5th, 2015
Introduces the possibility to output results in SAM format (for clustering,
pairwise alignment and searching).

v1.0.12 released February 6th, 2015
Temporarily fixes a problem with long headers in FASTA files.

v1.0.13 released February 17th, 2015
Fix a memory allocation problem when computing multiple sequence alignments
with the --msaout and --consout options, as well as a memory leak. Also
increased line buffer for reading FASTA files to 4MB.

v1.0.14 released February 17th, 2015
Fix a bug where the multiple alignment and consensus sequence computed after
clustering ignored the strand of the sequences. Also decreased size of line
buffer for reading FASTA files to 1MB again due to excessive stack memory
usage.

v1.0.15 released February 18th, 2015
Fix bug in calculation of identity metric between sequences when using the
MBL definition (--iddef 3).

v1.0.16 released February 19th, 2015
Integrated patches from Debian for increased compatibility with various
architectures.

v1.1.0 released February 20th, 2015
Added the --quiet option to suppress all output to stdout and stdout except
for warnings and fatal errors. Added the --log option to write messages to
a log file.

v1.1.1 released February 20th, 2015
Added info about --log and --quiet options to help text.

v1.1.2 released March 18th, 2015
Fix bug with large datasets. Fix format of help info.

v1.1.3 released March 18th, 2015
Fix more bugs with large datasets.