Ubuntu Manpage: mhap - probabilistic sequence overlapping

NAME

       mhap - probabilistic sequence overlapping

DESCRIPTION

       Please  set the -s or the -p options. See options below: MHAP: MinHash Alignment Protocol.
       A tool for finding overlaps of  long-read  sequences  (such  as  PacBio  or  Nanopore)  in
       bioinformatics.

              Version:  1.6,  Build  time:  09/12/2015  11:46 PM Usage 1 (direct execution): java
              -server -Xmx<memory> -jar <MHAP jar> -s<fasta/dat from/self file> [-q<fasta/dat  to
              file>]  [-f<kmer  filter  list,  must  be  sorted>]  Usage  2 (generate precomputed
              binaries): java -server -Xmx<memory> -jar <MHAP jar> -p<directory of  fasta  files>
              -q <output directory> [-f<kmer filter list, must be sorted>]

       --alignment, default = false
              Experimental option.

       --alignment-offset, default = -0.535
              The offset to account for the variance in the alignment match score.

       --alignment-score, default = 1.0E-6
              The cutoff score for alignment matches.

       --filter-threshold, default = 1.0E-5
              [double],  the  cutoff  at  which  the k-mer in the k-mer filter file is considered
              repetitive. This value for a specific k-mer is specified in the  second  column  in
              the filter file. If no filter file is provided, this option is ignored.

       --help, default = false
              Displays the help menu.

       --max-shift, default = 0.2
              [double],  region  size  to the left and right of the estimated overlap, as derived
              from the median shift  and  sequence  length,  where  a  k-mer  matches  are  still
              considered valid. Second stage filter only.

       --min-store-length, default = 0
              [int], The minimum length of the read that is stored in the box. Used to filter out
              short reads from FASTA file.

       --nanopore-fast, default = false
              Set all the parameters for the Nanopore fast settings. This  is  the  current  best
              guidance, and could change at any time without warning.

       --no-self, default = false
              Do not compute the overlaps between sequences inside a box. Should be used when the
              to and from sequences are coming from different files.

       --num-hashes, default = 512
              [int], number of min-mers to be used in MinHashing.

       --num-min-matches, default = 3
              [int], minimum # min-mer that must be shared before computing second stage  filter.
              Any sequences below that value are considered non-overlapping.

       --num-threads, default = 12
              [int], number of threads to use for computation. Typically set to 2 x #cores.

       --pacbio-fast, default = false
              Set  all  the  parameters  for  the  PacBio  fast setting. This is the current best
              guidance, and could change at any time without warning.

       --pacbio-sensitive, default = false
              Set all the parameters for the PacBio sensitive settings. This is the current  best
              guidance, and could change at any time without warning.

       --store-full-id, default = false
              Store  full  IDs  as  seen  in  FASTA  file,  rather than storing just the sequence
              position in the file. Some FASTA files have long IDS, slowing  output  of  results.
              This options is ignored when using compressed file format.

       --threshold, default = 0.04
              [double],  the  threshold  similarity  score cutoff for the second stage sort-merge
              filter. This is based on the average number of k-mers matching in  the  overlapping
              region.

       --version, default = false
              Displays the version and build time.

       --weighted, default = false
              Perform weighted MinHashing.

       -f, default = ""
              k-mer  filter  file used for filtering out highly repetative k-mers. Must be sorted
              in descending order of frequency (second column).

       -h, default = false
              Displays the help menu.

       -k, default = 16
              [int], k-mer size used for MinHashing. The k-mer size for second  stage  filter  is
              separate, and cannot be modified.

       -p, default = ""
              Usage  2  only.  The  directory  containing FASTA files that should be converted to
              binary format for storage.

       -q, default = ""
              Usage 1: The FASTA file of reads, or a directory of files, that will be compared to
              the  set of reads in the box (see -s). Usage 2: The output directory for the binary
              formatted dat files.

       -s, default = ""
              Usage 1 only. The FASTA or binary dat file (see Usage 2)  of  reads  that  will  be
              stored in a box, and that all subsequent reads will be compared to.