Provided by: vmatch_2.3.1+dfsg-7_amd64 bug

NAME

       mkvtree - construct index for sequence

SYNOPSIS

       mkvtree [options]

DESCRIPTION

       The program mkvtree constructs an index for a given set of sequences. These are given as a
       list of input files. The sequences are referred to as database sequences. They can be over
       any given alphabet. The alphabet can be the DNA alphabet, or the protein alphabet, or any
       other alphabet consisting of printable characters. An alphabet is specified by a file
       storing a symbol mapping. The index consists of several files, the index files. Each such
       file stores a different table. The user specifies which tables (i.e. which part of the
       index) is written to a file, using one of eight output options, or a single option
       specifying that all tables are written to file.

       We support the following formats for the input files. They are recognized according to the
       first non-whitespace symbol in the file.

       •   multiple FASTA format: If the file begins with the symbol ">", then this file is
           considered to be a file in multiple FASTA format (i.e. it contains one or more
           sequences). Each line starting with the symbol ">" contains the description of the
           sequence following it. Each line not starting with the symbol ">" contains the
           sequence. Empty lines are allowed and ignored when reading the input.

       •   multiple EMBL/SWISSPROT format: If the file begins with the string "ID", then this
           file is considered to be a file in multiple EMBL format (i.e. containing one or more
           sequences, each in EMBL format). The information contained in the "ID" and "DE" lines
           is taken as the description of the corresponding sequence. The EMBL format is
           identical to the SWISSPROT format (w.r.t. the information we need to extract from such
           entries). So one can also use files in multiple SWISSPROT format as input.

       •   multiple GENBANK format: If the file begins with the string "LOCUS", then this file is
           considered to be a file in multiple GENBANK format (i.e. containing one or more
           entries in GENBANK format). The information contained in the "LOCUS" and the
           "DEFINITION" lines is taken as the description of the corresponding sequence.

       •   plain format: If the file does not begin with the symbol ">" or the strings "ID" or
           "LOCUS", then the file is taken verbatim. That is, the entire file is considered to be
           the input sequence (whitespaces are not ignored).

       There is no special option necessary to tell the program the sequence format. It
       automatically detects the appropriate format, according to the rules given above. If none
       of the above rules apply, then the program cannot recognize the input format and exits
       with error code 1. In such a case please check you input files for if they are conform
       with the input formats above. Another good solution is to use a more versatile sequence
       format transformation programs (e.g. readseq) to first generate multiple FASTA files and
       then feed this into mkvtree.

       Today many files containing sequence files are provided compressed by the program gzip. To
       simplify the use of these files, mkvtree also accepts gzipped input files. These files
       must have the ending ".gz". The gzipped formatted files are gunzipped internally and then
       processed as any other file.

OPTIONS

       -db <file>
           Specify database files (mandatory).

       -smap <file>
           Specify file containing a symbol mapping. This describes the grouping of symbols. It
           is possible to set the environment variable MKVTREESMAPDIR to the path where these
           files can be found.

       -dna
           Input is DNA sequence.

       -protein
           Input is Protein sequence.

       -indexname <string>
           Specify name for index to be generated.

       -pl <length>
           Specify prefix length for bucket sort. Recommendation: use without argument; then a
           reasonable prefix length is automatically determined.

       -tis
           Output transformed input sequences (tistab) to file.

       -ois
           Output original input sequences (oistab) to file.

       -suf
           Output suffix array (suftab) to file.

       -sti1
           Output reduced inverse suffix array (sti1tab) to file.

       -bwt
           Output Burrows-Wheeler Transformation (bwttab) to file.

       -bck
           Output bucket boundaries (bcktab) to file.

       -skp
           Output skip values (skptab) to file.

       -lcp
           Output longest common prefix lengths (lcptab) to file.

       -allout
           Output all index tables to files.

       -maxdepth <len>
           Restrict the sorting to prefixes of the given length.

       -v
           Verbose mode

       -version
           Show the version of the Vmatch package.

       -help
           Show help.

RETURNS

       If an error occurs, the program exits with error code 1. Otherwise, the exit code is 0.

SEE ALSO

       mkdna6idx(1)

                                                                                       MKVTREE(1)