Ubuntu Manpage: daligner - long read aligner

name
synopsis
description
see also

Provided by: daligner_1.0+20151214-1_amd64

NAME

       daligner - long read aligner

SYNOPSIS

       daligner   [-vbAI][-kint(14)]   [-wint(6)]  [-hint(35)]  [-tint]  [-Mint]  [-edouble(.70)]  [-lint(1000)]
       [-sint(100)] [-Hint] [-mtrack]+ subject:db|dam target:db|dam ...

DESCRIPTION

Compare sequences in the trimmed subject block against those in the list of target blocks searching for
local alignments involving at least -l base pairs (default 1000) or more, that have an average
correlation rate of -e (default 70%). The local alignments found will be output in a sparse encoding
where a trace point on the alignment is recorded every -s base pairs of the a-read (default 100bp).
Reads are compared in both orientations and local alignments meeting the criteria are output to one of
several created files described below. The -v option turns on a verbose reporting mode that gives
statistics on each major step of the computation.

The options -k, -h, and -w control the initial filtration search for possible matches between reads.
Specifically, our search code looks for a pair of diagonal bands of width 2^w (default 2^6 = 64) that
contain a collection of exact matching k-mers (default 14) between the two reads, such that the total
number of bases covered by the k-mer hits is h (default 35). k cannot be larger than 32 in the current
implementation. If the -b option is set, then the daligner assumes the data has a strong compositional
bias (e.g. >65% AT rich), and at the cost of a bit more time, dynamically adjusts k-mer sizes depending
on compositional bias, so that the mers used have an effective specificity of 4^k.

If there are one or more interval tracks specified with the -m option, then the reads of the DB or DB's
to which the mask applies are soft masked with the union of the intervals of all the interval tracks that
apply, that is any k-mers that contain any bases in any of the masked intervals are ignored for the
purposes of seeding a match. An interval track is a track, such as the "dust" track created by DBdust,
that encodes a set of intervals over either the untrimmed or trimmed DB.

Invariably, some k-mers are significantly over-represented (e.g. homopolymer runs). These k-mers create
an excessive number of matching k-mer pairs and left unaddressed would cause daligner to overflow the
available physical memory. One way to deal with this is to explicitly set the -t parameter which
suppresses the use of any k-mer that occurs more than t times in either the subject or target block.
However, a better way to handle the situation is to let the program automatically select a value of t
that meets a given memory usage limit specified (in Gb) by the -M parameter. By default daligner will
use the amount of physical memory as the choice for -M. If you want to use less, say only 8Gb on a 24Gb
HPC cluster node because you want to run 3 daligner jobs on the node, then specify -M8. Specifying -M0
basically indicates that you do not want daligner to self adjust k-mer suppression to fit within a given
amount of memory.

For each subject, target pair of blocks, say X and Y, the program reports alignments where the a-read is
in X and the b-read is in Y, and vice versa. However, if the -A option is set ("A" for "asymmetric")
then just overlaps where the a-read is in X and the b-read is in Y are reported, and if X = Y, then it
further reports only those overlaps where the a-read index is less than the b-read index. In either
case, if the -I option is set ("I" for "identity") then when X = Y, overlaps between different portions
of the same read will also be found and reported.

Each found alignment is recorded as -- a[ab,ae] x bo[bb,be] -- where a and b are the indices (in the
trimmed DB) of the reads that overlap, o indicates whether the b-read is from the same or opposite
strand, and [ab,ae] and [bb,be] are the intervals of a and bo, respectively, that align. The program
places these alignment records in files whose name is of the form X.Y.[C|N]#.las where C indicates that
the b-reads are complemented and N indicates they are not (both comparisons are performed) and # is the
thread that detected and wrote out the collection of alignments contained in the file. That is the file
X.Y.O#.las contains the alignments produced by thread # for which the a-read is from X and the b-read is
from Y and in orientation O. The command daligner -A X Y produces 2*NTHREAD thread files X.Y.?.las and
daligner X Y produces 4*NTHREAD files X.Y.?.las and Y.X.?.las (unless X=Y in which case only NTHREAD
files, X.X.?.las, are produced).

By default, daligner compares all overlaps between reads in the database that are greater than the
minimum cutoff set when the DB or DBs were split, typically 1 or 2 Kbp. However, the HGAP assembly
pipeline only wants to correct large reads, say 8Kbp or over, and so needs only the overlaps where the
a-read is one of the large reads. By setting the -H parameter to say N, one alters daligner so that it
only reports overlaps where the a-read is over N base-pairs long.

While the default parameter settings are good for raw Pacbio data, daligner can be used for efficiently
finding alignments in corrected reads or other less noisy reads. For example, for mapping applications
against .dams, we run

daligner -k20 -h60 -e.85

and on corrected reads, we typically run

daligner -k25 -w5 -h60 -e.95 -s500

and at these settings it is very fast.

NAME

SYNOPSIS

DESCRIPTION

SEE ALSO