Ubuntu Manpage: mash-dist - estimate the distance of query sequences to references

Provided by: mash_2.3+dfsg-1build2_amd64

NAME

       mash-dist - estimate the distance of query sequences to references

SYNOPSIS

       mash dist [options] <reference> <query> [<query>] ...

DESCRIPTION

       Estimate the distance of each query sequence to the reference. Both the reference and queries can be
       fasta or fastq, gzipped or not, or Mash sketch files (.msh) with matching k-mer sizes. Query files can
       also be files of file names (see -l). Whole files are compared by default (see -i). The output fields are
       [reference-ID, query-ID, distance, p-value, shared-hashes].

OPTIONS

       -h
           Help

       -p <int>
           Parallelism. This many threads will be spawned for processing. [1]

   Input
       -l
           List input. Each query file contains a list of sequence files, one per line. The reference file is
           not affected.

   Output
       -t
           Table output (will not report p-values, but fields will be blank if they do not meet the p-value
           threshold).

       -v <num>
           Maximum p-value to report. (0-1) [1.0]

       -d <num>
           Maximum distance to report. (0-1) [1.0]

   Sketching
       -k <int>
           K-mer size. Hashes will be based on strings of this many nucleotides. Canonical nucleotides are used
           by default (see Alphabet options below). (1-32) [21]

       -s <int>
           Sketch size. Each sketch will have at most this many non-redundant min-hashes. [1000]

       -i
           Sketch individual sequences, rather than whole files.

       -w <num>
           Probability threshold for warning about low k-mer size. (0-1) [0.01]

       -r
           Input is a read set. See Reads options below. Incompatible with -i.

   Sketching (reads)
       -b <size>
           Use a Bloom filter of this size (raw bytes or with K/M/G/T) to filter out unique k-mers. This is
           useful if exact filtering with -m uses too much memory. However, some unique k-mers may pass
           erroneously, and copies cannot be counted beyond 2. Implies -r.

       -m <int>
           Minimum copies of each k-mer required to pass noise filter for reads. Implies -r. [1]

       -c <num>
           Target coverage. Sketching will conclude if this coverage is reached before the end of the input file
           (estimated by average k-mer multiplicity). Implies -r.

       -g <size>
           Genome size. If specified, will be used for p-value calculation instead of an estimated size from
           k-mer content. Implies -r.

   Sketching (alphabet)
       -n
           Preserve strand (by default, strand is ignored by using canonical DNA k-mers, which are alphabetical
           minima of forward-reverse pairs). Implied if an alphabet is specified with -a or -z.

       -a
           Use amino acid alphabet (A-Z, except BJOUXZ). Implies -n, -k 9.

       -z <text>
           Alphabet to base hashes on (case ignored by default; see -Z). K-mers with other characters will be
           ignored. Implies -n.

       -Z
           Preserve case in k-mers and alphabet (case is ignored by default). Sequence letters whose case is not
           in the current alphabet will be skipped when sketching.

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

SEE ALSO