xenial (1) daligner.1.gz

Provided by: daligner_1.0+20151214-1_amd64 bug

NAME

       daligner - long read aligner

SYNOPSIS

       daligner   [-vbAI][-kint(14)]   [-wint(6)]  [-hint(35)]  [-tint]  [-Mint]  [-edouble(.70)]  [-lint(1000)]
       [-sint(100)] [-Hint] [-mtrack]+ subject:db|dam target:db|dam ...

DESCRIPTION

       Compare sequences in the trimmed subject block against those in the list of target blocks  searching  for
       local  alignments  involving  at  least  -l  base  pairs  (default  1000)  or  more, that have an average
       correlation rate of -e (default 70%).  The local alignments found will be output  in  a  sparse  encoding
       where  a  trace  point  on  the  alignment is recorded every -s base pairs of the a-read (default 100bp).
       Reads are compared in both orientations and local alignments meeting the criteria are output  to  one  of
       several  created  files  described  below.   The  -v  option turns on a verbose reporting mode that gives
       statistics on each major step of the computation.

       The options -k, -h, and -w control the initial filtration search  for  possible  matches  between  reads.
       Specifically,  our  search  code  looks for a pair of diagonal bands of width 2^w (default 2^6 = 64) that
       contain a collection of exact matching k-mers (default 14) between the two reads,  such  that  the  total
       number  of  bases covered by the k-mer hits is h (default 35).  k cannot be larger than 32 in the current
       implementation.  If the -b option is set, then the daligner assumes the data has a  strong  compositional
       bias  (e.g.  >65% AT rich), and at the cost of a bit more time, dynamically adjusts k-mer sizes depending
       on compositional bias, so that the mers used have an effective specificity of 4^k.

       If there are one or more interval tracks specified with the -m option, then the reads of the DB  or  DB's
       to which the mask applies are soft masked with the union of the intervals of all the interval tracks that
       apply, that is any k-mers that contain any bases in any of the  masked  intervals  are  ignored  for  the
       purposes  of  seeding a match.  An interval track is a track, such as the "dust" track created by DBdust,
       that encodes a set of intervals over either the untrimmed or trimmed DB.

       Invariably, some k-mers are significantly over-represented (e.g.  homopolymer runs).  These k-mers create
       an  excessive  number  of  matching k-mer pairs and left unaddressed would cause daligner to overflow the
       available physical memory.  One way to deal with this  is  to  explicitly  set  the  -t  parameter  which
       suppresses  the  use  of  any  k-mer that occurs more than t times in either the subject or target block.
       However, a better way to handle the situation is to let the program automatically select  a  value  of  t
       that  meets  a  given memory usage limit specified (in Gb) by the -M parameter.  By default daligner will
       use the amount of physical memory as the choice for -M.  If you want to use less, say only 8Gb on a  24Gb
       HPC  cluster  node because you want to run 3 daligner jobs on the node, then specify -M8.  Specifying -M0
       basically indicates that you do not want daligner to self adjust k-mer suppression to fit within a  given
       amount of memory.

       For  each subject, target pair of blocks, say X and Y, the program reports alignments where the a-read is
       in X and the b-read is in Y, and vice versa.  However, if the -A option is  set  ("A"  for  "asymmetric")
       then  just  overlaps  where the a-read is in X and the b-read is in Y are reported, and if X = Y, then it
       further reports only those overlaps where the a-read index is less than  the  b-read  index.   In  either
       case,  if  the -I option is set ("I" for "identity") then when X = Y, overlaps between different portions
       of the same read will also be found and reported.

       Each found alignment is recorded as -- a[ab,ae] x bo[bb,be] -- where a and b  are  the  indices  (in  the
       trimmed  DB)  of  the  reads  that  overlap,  o indicates whether the b-read is from the same or opposite
       strand, and [ab,ae] and [bb,be] are the intervals of a and bo, respectively,  that  align.   The  program
       places  these  alignment records in files whose name is of the form X.Y.[C|N]#.las where C indicates that
       the b-reads are complemented and N indicates they are not (both comparisons are performed) and #  is  the
       thread  that detected and wrote out the collection of alignments contained in the file.  That is the file
       X.Y.O#.las contains the alignments produced by thread # for which the a-read is from X and the b-read  is
       from  Y  and in orientation O.  The command daligner -A X Y produces 2*NTHREAD thread files X.Y.?.las and
       daligner X Y produces 4*NTHREAD files X.Y.?.las and Y.X.?.las (unless X=Y  in  which  case  only  NTHREAD
       files, X.X.?.las, are produced).

       By  default,  daligner  compares  all  overlaps  between  reads in the database that are greater than the
       minimum cutoff set when the DB or DBs were split, typically 1 or  2  Kbp.   However,  the  HGAP  assembly
       pipeline  only  wants  to correct large reads, say 8Kbp or over, and so needs only the overlaps where the
       a-read is one of the large reads.  By setting the -H parameter to say N, one alters daligner so  that  it
       only reports overlaps where the a-read is over N base-pairs long.

       While  the  default parameter settings are good for raw Pacbio data, daligner can be used for efficiently
       finding alignments in corrected reads or other less noisy reads.  For example, for  mapping  applications
       against .dams, we run

       daligner -k20 -h60 -e.85

       and on corrected reads, we typically run

       daligner -k25 -w5 -h60 -e.95 -s500

       and at these settings it is very fast.

SEE ALSO

       LAsort(1) LAmerge(1) LAshow(1) LAcat(1) LAsplit(1) LAcheck(1) HPCdaligner(1) HPCmapper(1)