Ubuntu Manpage: wtdbg2 - de novo sequence assembler for long noisy reads

NAME

       wtdbg2 - de novo sequence assembler for long noisy reads

SYNOPSIS

       wtdbg2 [options] -i <reads.fa> -o <prefix> [reads.fa ...]

DESCRIPTION

       WTDBG:  De  novo  assembler  for  long  noisy sequences Author: Jue Ruan <ruanjue@gmail.com> Version: 2.5
       (20190621)

OPTIONS


       -i <string> Long reads sequences file (REQUIRED; can be multiple), []

       -o <string> Prefix of output files (REQUIRED), []

       -t <int>
              Number of threads, 0 for all cores, [4]

       -f     Force to overwrite output files

       -x <string> Presets, comma delimited, []

       preset1/rsII/rs: -p 21 -S 4 -s 0.05 -L 5000
              preset2: -p 0 -k 15 -AS 2 -s 0.05 -L 5000 preset3: -p 19 -AS 2 -s 0.05 -L 5000

              sequel/sq

              nanopore/ont:

              (genome size < 1G: preset2) -p 0 -k 15 -AS 2 -s 0.05 -L 5000 (genome size >= 1G:  preset3)  -p  19
              -AS 2 -s 0.05 -L 5000

              preset4/corrected/ccs: -p 21 -k 0 -AS 4 -K 0.05 -s 0.5

       -g <number> Approximate genome size (k/m/g suffix allowed) [0]

       -X <float>
              Choose the best <float> depth from input reads(effective with -g) [50.0]

       -L <int>
              Choose  the  longest  subread  and drop reads shorter than <int> (5000 recommended for PacBio) [0]
              Negative integer indicate tidying read names too, e.g. -5000.

       -k <int>
              Kmer fsize, 0 <= k <= 23, [0]

       -p <int>
              Kmer psize, 0 <= p <= 23, [21] k + p <= 25, seed is <k-mer>+<p-homopolymer-compressed>

       -K <float>
              Filter high frequency kmers, maybe repetitive, [1000.05] >= 1000 and indexing  >=  (1  -  0.05)  *
              total_kmers_count

       -S <float>
              Subsampling  kmers,  1/(<-S>)  kmers  are  indexed,  [4.00] -S is very useful in saving memory and
              speeding up please note that subsampling kmers will have less matched length

       -l <float>
              Min length of alignment, [2048]

       -m <float>
              Min matched length by kmer matching, [200]

       -R     Enable realignment mode

       -A     Keep contained reads during alignment

       -s <float>
              Min similarity, calculated by kmer matched length / aligned length, [0.05]

       -e <int>
              Min read depth of a valid edge, [3]

       -q     Quiet

       -v     Verbose (can be multiple)

       -V     Print version information and then exit

       --help Show more options

              ** more options ** --cpu <int>

              See -t 0, default: all cores

       --input <string> +

              See -i

       --force

              See -f

       --prefix <string>

              See -o

       --preset <string>

              See -x

       --kmer-fsize <int>

              See -k 0

       --kmer-psize <int>

              See -p 21

       --kmer-depth-max <float>

              See -K 1000.05

       -E, --kmer-depth-min <int>

              Min kmer frequency, [2]

       --kmer-subsampling <float>

              See -S 4.0

       --kbm-parts <int>

              Split total reads into multiple parts, index one part by one to save memory, [1]

       --aln-kmer-sampling <int>

              Select no more than n seeds in a query bin, default: 256

       --dp-max-gap <int>

              Max number of bin(256bp) in one gap, [4]

       --dp-max-var <int>

              Max number of bin(256bp) in one deviation, [4]

       --dp-penalty-gap <int>

              Penalty for BIN gap, [-7]

       --dp-penalty-var <int>

              Penalty for BIN deviation, [-21]

       --aln-min-length <int>

              See -l 2048

       --aln-min-match <int>

              See -m 200. Here the num of matches counting basepair of the matched kmer's regions

       --aln-min-similarity <float>

              See -s 0.05

       --aln-max-var <float>

              Max length variation of two aligned fragments, default: 0.25

       --aln-dovetail <int>

              Retain dovetail overlaps only, the max overhang size is  <--aln-dovetail>,  the  value  should  be
              times of 256, -1 to disable filtering, default: 256

       --aln-strand <int>

              1:  forward,  2: reverse, 3: both. Please don't change the deault value 3, unless you exactly know
              what you are doing

       --aln-maxhit <int>

              Max n hits for each read in build graph, default: 1000

       --aln-bestn <int>

              Use best n hits for each read in build graph, 0: keep all, default: 500 <prefix>.alignments always
              store all alignments

       -R, --realign

              Enable      re-alignment,       see       --realn-kmer-psize=15,       --realn-kmer-subsampling=1,
              --realn-min-length=2048, --realn-min-match=200, --realn-min-similarity=0.1, --realn-max-var=0.25

       --realn-kmer-psize <int>

              Set kmer-psize in realignment, (kmer-ksize always eq 0), default:15

       --realn-kmer-subsampling <int>

              Set kmer-subsampling in realignment, default:1

       --realn-min-length <int>

              Set aln-min-length in realignment, default: 2048

       --realn-min-match <int>

              Set aln-min-match in realignment, default: 200

       --realn-min-similarity <float>

              Set aln-min-similarity in realignment, default: 0.1

       --realn-max-var <float>

              Set aln-max-var in realignment, default: 0.25

       -A, --aln-noskip

              Even a read was contained in previous alignment, still align it against other reads

       --keep-multiple-alignment-parts

              By default, wtdbg will keep only the best alignment between two reads after chainning. This option
              will disable it, and keep multiple

       --verbose +

              See -v. -vvvv will display the most detailed information

       --quiet

              See -q

       --limit-input <int>

              Limit the input sequences to at most <int> M bp. Usually for test

       -L <int>, --tidy-reads <int>

              Default:  0.  Pick longest subreads if possible. Filter reads less than <--tidy-reads>. Please add
              --tidy-name or set --tidy-reads to nagetive value if want to rename reads. Set to 0 bp to  disable
              tidy. Suggested value is 5000 for pacbio RSII reads

       --tidy-name

              Rename reads into 'S%010d' format. The first read is named as S0000000001

       --rdname-filter <string>

              A  file  contains  lines  of reads name to be discarded in loading. If you want to filter reads by
              yourself, please also set -X 0

       --rdname-includeonly <string>

              Reverse manner with --rdname-filter

       -g <number>, --genome-size <number>

              Provide genome size, e.g. 100.4m, 2.3g. In this version, it  is  used  with  -X/--rdcov-cutoff  in
              selecting reads just after readed all.

       -X <float>, --rdcov-cutoff <float>

              Default: 50.0. Retaining 50.0 folds of genome coverage, combined with -g and --rdcov-filter.

       --rdcov-filter [0|1]

              Default 0. Strategy 0: retaining longest reads. Strategy 1: retaining medain length reads.

       --err-free-nodes

              Select  nodes  from error-free-sequences only. E.g. you have contigs assembled from NGS-WGS reads,
              and long noisy  reads.   You  can  type  '--err-free-seq  your_ctg.fa  --input  your_long_reads.fa
              --err-free-nodes' to perform assembly somehow act as long-reads scaffolding

       --node-len <int>

              The  default value is 1024, which is times of KBM_BIN_SIZE(always equals 256 bp). It specifies the
              length of intervals (or call nodes after selecting).  kbm indexs sequences into BINs of 256 bp  in
              size,  so  that  many  parameter  should  be  times  of 256 bp. There are: --node-len, --node-ovl,
              --aln-min-length,  --aln-dovetail  .    Other  parameters  are  counted  in  BINs,   --dp-max-gap,
              --dp-max-var .

       --node-matched-bins <int>

              Min matched bins in a node, default:1

       --node-ovl <int>

              Default: 256. Max overlap size between two adjacent intervals in any read. It is used in selecting
              best nodes representing reads in graph

       --node-drop <float>

              Default:  0.25.  Will  discard  an  node  when  has  more this ratio intervals are conflicted with
              previous generated node

       -e <int>, --edge-min=<int>

              Default: 3. The minimal depth of a valid edge is set to 3. In another word, Valid  edges  must  be
              supported  by  at  least  3 reads When the sequence depth is low, have a try with --edge-min 2. Or
              very high, try --edge-min 4

       --edge-max-span <int>

              Default: 1024 BINs. Program will build edges of length no large than 1024

       --drop-low-cov-edges

              Don't attempt to rescue low coverage edges

       --node-min <int>

              Min depth of an interval to be selected as valid node. Defaultly, this value is automatically  the
              same with --edge-min.

       --node-max <int>

              Nodes  with  too high depth will be regarded as repetitive, and be masked. Default: 200, more than
              200 reads contain this node

       --ttr-cutoff-depth <int>, 0

       --ttr-cutoff-ratio <float>, 0.5

              Tiny Tandom Repeat. A node located inside ttr will bring noisy in graph,  should  be  masked.  The
              pattern  of  such  nodes  is:  depth  >=  <--ttr-cutoff-depth>, and none of their edges have depth
              greater than depth * <--ttr-cutoff-ratio 0.5> set --ttr-cutoff-depth 0 to disable ttr masking

       --dump-kbm <string>

              Dump kbm index into file for loaded by `kbm` or `wtdbg`

       --dump-seqs <string>

              Dump kbm index (only sequences, no k-mer index) into file for loaded by `kbm`  or  `wtdbg`  Please
              note: normally load it with --load-kbm, not with --load-seqs

       --load-kbm <string>

              Instead  of  reading  sequences  and building kbm index, which is time-consumed, loading kbm-index
              from already dumped file.  Please note that, once kbm-index is mmaped by kbm -R <kbm-index> start,
              will just get the shared memory in minute time.  See `kbm` -R <your_seqs.kbmidx> [start | stop]

       --load-seqs <string>

              Similar with --load-kbm, but only use the sequences in kbmidx, and rebuild index in process's RAM.

       --load-alignments <string> +

              `wtdbg` output reads' alignments into <--prefix>.alignments, program can load them to fastly build
              assembly graph. Or you can offer other source of alignments  to  `wtdbg`.  When  --load-alignment,
              will only reading long sequences but skip building kbm index You can type --load-alignments <file>
              more than once to load alignments from many files

       --load-clips <string>

              Combined with --load-nodes. Load reads clips. You can find it in `wtdbg`'s <--prefix>.clps

       --load-nodes <sting>

              Load  dumped  nodes  from  previous  execution  for  fast  construct the assembly graph, should be
              combined with --load-clips. You can find it in `wtdbg`'s <--prefix>.1.nodes

       --bubble-step <int>

              Max step to search a bubble, meaning the max step from the  starting  node  to  the  ending  node.
              Default: 40

       --tip-step <int>

              Max step to search a tip, 10

       --ctg-min-length <int>

              Min length of contigs to be output, 5000

       --ctg-min-nodes <int>

              Min num of nodes in a contig to be output, 3

       --minimal-output

              Will generate as less output files (<--prefix>.*) as it can

       --bin-complexity-cutoff <int>

              Used  in  filtering  BINs. If a BIN has less indexed valid kmers than <--bin-complexity-cutoff 2>,
              masks it.

       --no-local-graph-analysis

              Before building edges, for each node, local-graph-analysis reads all related reads  and  according
              nodes,  and  builds a local graph to judge whether to mask it The analysis aims to find repetitive
              nodes

       --no-read-length-sort

              Defaultly, `wtdbg` sorts input sequences by length DSC. The order of reads affects the  generating
              of nodes in selecting important intervals

       --keep-isolated-nodes

              In graph clean, `wtdbg` normally masks isolated (orphaned) nodes

       --no-read-clip

              Defaultly,  `wtdbg` clips a input sequence by analyzing its overlaps to remove high error endings,
              rolling-circle repeats (see PacBio CCS), and chimera.  When building edges, clipped  region  won't
              contribute. However, `wtdbg` will use them in the final linking of unitigs

       --no-chainning-clip

              Defaultly,  performs  alignments  chainning in read clipping ** If '--aln-bestn 0 --no-read-clip',
              alignments will be parsed directly, and less RAM spent on recording alignments

AUTHOR

        This manpage was written by Andreas Tille for the Debian distribution and
        can be used for any other usage of the program.

wtdbg2 2.5                                         April 2020                                          WTDBG2(1)