lunar (1) dnaclust.1.gz

Provided by: dnaclust_3-7build2_amd64 bug

NAME

       dnaclust - program to cluster large number of short DNA sequences

SYNOPSIS

       dnaclust {-i | --input} infile [{-s | --similarity} threshold]
                [{-m | --multiple-alignment}] [{-d | --header}] [{-l | --left-gaps-allowed}]
                [{-k | --k-mer-length} length] [{-a | --approximate-filter}] [--no-k-mer-filter]

       dnaclust [{-h | --help} | {-v | --version}]

DESCRIPTION

       This manual page documents briefly the dnaclust program.

       dnaclust is a tool for clustering large number of short DNA sequences. The clusters are
       created in such a way that the "radius" of each clusters is no more than the specified
       threshold.

       The input sequences to be clustered should be in Fasta format. The id of each sequence is
       based on the first word of the seqeunce in the Fasta format. The first word is the prefix
       of the header up to the first occurrence of white space characters in the header. The
       output is written to STDOUT. If you want the output to be written to a file, just redirect
       the output (See Examples).

       The output has two modes: the default clustering mode, and clustering with multiple
       sequence alignment. In the clustering mode (without multiple alignment), each cluster will
       be printed on a separate line. The line will contain the ids of the sequences in the
       cluster. The first id in each line is the cluster center sequence id. Because of the way
       our clusters are constructed, the length of the cluster center sequence is always greater
       than or equal to the length of any of the sequences in the cluster. Please note that since
       usually some clusters contain many sequences, the lines of the output may be very long. If
       you want to visually inspect the output, please use the 'less -S', or an editor that does
       not wrap long lines. The number of clusters can be found using 'wc -l'.

       For more information about the multiple sequence alignment mode see the description of
       --multiple-alignment option.

OPTIONS

       The program follows the usual GNU command line syntax, with long options starting with two
       dashes ('-'). A summary of options is included below.

       --similarity threshold, -s threshold
           The similarity threshold specifies the radius of the clusters created. This parameter
           is a floating point number between 0 and 1. It is calculated based on semi-global
           alignment of a sequence to the cluster center sequence. Namely similarity = 1 - (edit
           distance) / (length of the shorter sequence). The edit distance is the minimum number
           of insertions, deletions or substitutions necessary to aling a sequence to the cluster
           center sequence. Our algorithms are faster when the similarity is higher.

       --k-mer-length length, -k length
           When you use the k-mer filter (which is enabled by default) you can specify the
           maximum length of the k-mers used for filtering.

           The longer k-mer lengths require more memory to store k-mer counts and the filtering
           will be slower. However with the longer k-mer length, the filter will be more specific
           and therefore the sequence alignment search may be faster.

           There is a tradeoff between filtering and searching time. If you do not specify the
           k-mer length a value of log4(median of the lengths of the input sequences) is picked
           automatically. By using this option you can override the default value.

           Keep in mind, however, that longer k-mer lengths would require more memory to store
           the filtering data structures.

       --approximate-filter , -a
           By default the k-mer filter is 100 percent sensitive. This means that in the output
           clustering, no two cluster centers are within the threshold distance from each other.
           The exact filter, however, is somewhat slow. This option speeds up the filter by using
           a heuristic. The use of the approximate filter may result in cluster centers that are
           close, and a larger number clusters overall. However the approximate filter is usually
           several times faster than the exact sensitive filter. Use this option if you are
           clustering primarily to reduce the redundancy in the data, and do not care about the
           quality of the clustering.

       --allow-left-gaps , -l
           With this option the distances are measured based on semi-global alignment. The
           semi-global alignment allows for gaps without penalty on both ends of the shorter
           sequence.

           The default alignment is a one sided semi-global alignment. i.e. gaps are only allowed
           at the right end of the shorter sequence without penalty. This behavior corresponds to
           the data from targeted sequening of a region (e.g. of 16S ribosomal RNA gene).

       --multiple-alignment, -m
           Set the output format to show the multiple sequence alignment of each cluster. The
           gaps in the alignments are represented with the dash '-' character.

           The format of the MSA output is as follows: The MSA of each cluster spans several
           lines. The MSA starts with a line containing character '#' followed by the number of
           sequences in that cluster. The aligned sequences (which may contain gaps) follow in
           the Fasta format. Each Fasta record will be composed of two lines. The header line and
           the sequence line. Since each aligned sequence is output on a single line, the output
           may contain very long lines. Please use 'less -S', or an editor that does not wrap
           long lines for inspecting the MSA also.

       --no-k-mer-filter
           Disables the k-mer filter. Suitable for clustering very short sequences at a high
           similarity threshold.

       -d, --header
           Write program options to output.

       -h, --help
           Show summary of options.

       -v, --version
           Show version of program.

EXAMPLES

           ./dnaclust file.fasta -l -s 0.98 -k 3 > clusters

BUGS

       The program is currently limited to only work with the boost library.

SEE ALSO

       awk(1), less(1), wc(1)

AUTHOR

       Mohammadreza Ghodsi <ghodsi@cs.umd.edu>
           Designed and developed and maintains this package. Wrote this manpage.

       Copyright © 2010 Mohammadreza Ghodsi

       This manual page was written for the Debian system (but may be used by others).

       Permission is granted to copy, distribute and/or modify this document under the terms of
       the GNU General Public License, Version 2 or (at your option) any later version published
       by the Free Software Foundation.

       On Debian systems, the complete text of the GNU General Public License can be found in
       /usr/share/common-licenses/GPL.