Ubuntu Manpage: cmalign - align sequences to a covariance model

NAME

       cmalign - align sequences to a covariance model

SYNOPSIS

       cmalign
              [options] <cmfile> <seqfile>

DESCRIPTION

cmalign aligns the RNA sequences in <seqfile> to the covariance model (CM) in <cmfile>.
The new alignment is output to stdout in Stockholm format, but can be redirected to a file
<f> with the -o <f> option.

Either <cmfile> or <seqfile> (but not both) may be '-' (dash), which means reading this
input from stdin rather than a file.

The sequence file <seqfile> must be in FASTA or Genbank format.

cmalign uses an HMM banding technique to accelerate alignment by default as described
below for the --hbanded option. HMM banding can be turned off with the --nonbanded option.

By default, cmalign computes the alignment with maximum expected accuracy that is
consistent with constraints (bands) derived from an HMM, using a banded version of the
Durbin/Holmes optimal accuracy algorithm. This behavior can be changed with the --cyk or
--sample options.

cmalign takes special care to correctly align truncated sequences, where some nucleotides
from the beginning (5') and/or end (3') of the actual full length biological sequence are
not present in the input sequence (see DL Kolbe and SR Eddy, Bioinformatics, 25:1236-1243,
2009). This behavior is on by default, but can be turned off with --notrunc. In previous
versions of cmalign the --sub option was required to appropriately handle truncated
sequences. The --sub option is still available in this version, but the new default method
for handling truncated sequences should be as good or superior to the sub method in nearly
all cases.

The --mapali <s> option allows inclusion of the fixed training alignment used to build the
CM from file <s> within the output alignment of cmalign.

It is possible to merge two or more alignments created by the same CM using the Easel
miniapp esl-alimerge (included in the easel/miniapps/ subdirectory of Infernal). Previous
versions of cmalign included options to merge alignments but they were deprecated upon
development of esl-alimerge, which is significantly more memory efficient.

By default, cmalign will output the alignment to stdout. The alignment can be redirected
to an output file <f> with the -o <f> option. With -o, information on each aligned
sequence, including score and model alignment boundaries will be printed to stdout (more
on this below).

The output alignment will be in Stockholm format by default. This can be changed to Pfam,
aligned FASTA (AFA), A2M, Clustal, or Phylip format using the --outformat <s> option,
where <s> is the name of the desired format. As a special case, if the output alignment
is large (more than 10,000 sequences or more than 10,000,000 total nucleotides) than the
output format will be Pfam format, with each sequence appearing on a single line, for
reasons of memory efficiency. For alignments larger than this, using --ileaved will force
interleaved Stockholm format, but the user should be aware that this may require a lot of
memory. --ileaved will only work for alignments up to 100,000 sequences or 100,000,000
total nucleotides.

If the output alignment format is Stockholm or Pfam, the output alignment will be
annotated with posterior probabilities which estimate the confidence level of each aligned
nucleotide. This annotation appears as lines beginning with "#=GR <seq name> PP", one per
sequence, each immediately below the corresponding aligned sequence "<seq name>".
Characters in PP lines have 12 possible values: "0-9", "*", or ".". If ".", the position
corresponds to a gap in the sequence. A value of "0" indicates a posterior probability of
between 0.0 and 0.05, "1" indicates between 0.05 and 0.15, "2" indicates between 0.15 and
0.25 and so on up to "9" which indicates between 0.85 and 0.95. A value of "*" indicates a
posterior probability of between 0.95 and 1.0. Higher posterior probabilities correspond
to greater confidence that the aligned nucleotide belongs where it appears in the
alignment. With --nonbanded, the calculation of the posterior probabilities considers all
possible alignments of the target sequence to the CM. Without --nonbanded (i.e. in default
mode), the calculation considers only possible alignments within the HMM bands. Further,
the posterior probabilities are conditional on the truncation mode of the alignment. For
example, if the sequence alignment is truncated 5', a PP value of "9" indicates between
0.85 and 0.95 of all 5' truncated alignments include the given nucleotide at the given
position. The posterior annotation can be turned off with the --noprob option. If --small
is enabled, posterior annotation must also be turned off using --noprob.

The tabular output that is printed to stdout if the -o option is used includes one line
per sequence and twelve fields per line: "idx": the index of the sequence in the input
file, "seq name": the sequence name; "length": the length of the sequence; "cm from" and
"cm to": the model start and end positions of the alignment; "trunc": "no" if the sequence
is not truncated, "5'" if the beginning of the sequence truncated 5', "3'" if the end of
the sequence is truncated, and "5'&3'" if both the beginning and the end are truncated;
"bit sc": the bit score of the alignment, "avg pp" the average posterior probability of
all aligned nucleotides in the alignment; "band calc", "alignment" and "total": the time
in seconds required for calculating HMM bands, computing the alignment, and complete
processing of the sequence, respectively; "mem (Mb)": the size in Mb of all dynamic
programming matrices required for aligning the sequence. This tabular data can be saved
to file <f> with the --sfile <f> option.

OPTIONS

       -h     Help; print a brief reminder of command line usage and available options.

       -o <f> Save the alignment in Stockholm format to a file <f>.  The default is to  write  it
              to standard output.

       -g     Configure  the  model  for  global  alignment  of  the  query  model  to the target
              sequences.  By  default,  the  model  is  configured  for  local  alignment.  Local
              alignments  can  contain  large insertions and deletions called "local ends" in the
              structure to be penalized differently than normal indels. These  are  annotated  as
              "~"  columns  in  the RF line of the output alignment. The -g option can be used to
              disallow these local ends.  The -g option is required if the --sub option  is  also
              used.

OPTIONS FOR CONTROLLING THE ALIGNMENT ALGORITHM

--optacc
Align sequences using the Durbin/Holmes optimal accuracy algorithm. This is the
default. The optimal accuracy alignment will be constrained by HMM bands for
acceleration unless the --nonbanded option is enabled. The optimal accuracy
algorithm determines the alignment that maximizes the posterior probabilities of
the aligned nucleotides within it. The posterior probabilites are determined using
(possibly HMM banded) variants of the Inside and Outside algorithms.

--cyk Do not use the Durbin/Holmes optimal accuracy alignment to align the sequences,
instead use the CYK algorithm which determines the optimally scoring (maximum
likelihood) alignment of the sequence to the model, given the HMM bands (unless
--nonbanded is also enabled).

--sample
Sample an alignment from the posterior distribution of alignments. The posterior
distribution is determined using an HMM banded (unless --nonbanded) variant of the
Inside algorithm.

--seed <n>
Seed the random number generator with <n>, an integer >= 0. This option can only
be used in combination with --sample. If <n> is nonzero, stochastic sampling of
alignments will be reproducible; the same command will give the same results. If
<n> is 0, the random number generator is seeded arbitrarily, and stochastic
samplings may vary from run to run of the same command. The default seed is 181.

--notrunc
Turn off truncated alignment algorithms. All sequences in the input file will be
assumed to be full length, unless --sub is also used, in which case the program can
still handle truncated sequences but will use an alternative strategy for their
alignment.

--sub Turn on the sub model construction and alignment procedure. For each sequence, an
HMM is first used to predict the model start and end consensus columns, and a new
sub CM is constructed that only models consensus columns from start to end. The
sequence is then aligned to this sub CM. Sub alignment is an older method than the
default one for aligning sequences that are possibly truncated. By default, cmalign
uses special DP algorithms to handle truncated sequences which should be more
accurate than the sub method in most cases. --sub is still included as an option
mainly for testing against this default truncated sequence handling. This "sub CM"
procedure is not the same as the "sub CMs" described by Weinberg and Ruzzo.

OPTIONS FOR CONTROLLING SPEED AND MEMORY REQUIREMENTS

--hbanded
This option is turned on by default. Accelerate alignment by pruning away regions
of the CM DP matrix that are deemed negligible by an HMM. First, each sequence is
scored with a CM plan 9 HMM derived from the CM using the Forward and Backward HMM
algorithms to calculate posterior probabilities that each nucleotide aligns to each
state of the HMM. These posterior probabilities are used to derive constraints
(bands) on the CM DP matrix. Finally, the target sequence is aligned to the CM
using the banded DP matrix, during which cells outside the bands are ignored.
Usually most of the full DP matrix lies outside the bands (often more than 95%),
making this technique faster because fewer DP calculations are required, and more
memory efficient because only cells within the bands need be allocated.

Importantly, HMM banding sacrifices the guarantee of determining the optimally
accurarte or optimal alignment, which will be missed if it lies outside the bands.
The tau paramater is the amount of probability mass considered negligible during
HMM band calculation; lower values of tau yield greater speedups but also a greater
chance of missing the optimal alignment. The default tau is 1E-7, determined
empirically as a good tradeoff between sensitivity and speed, though this value can
be changed with the --tau <x> option. The level of acceleration increases with
both the length and primary sequence conservation level of the family. For example,
with the default tau of 1E-7, tRNA models (low primary sequence conservation with
length of about 75 nucleotides) show about 10X acceleration, and SSU bacterial rRNA
models (high primary sequence conservation with length of about 1500 nucleotides)
show about 700X. HMM banding can be turned off with the --nonbanded option.

--tau <x>
Set the tail loss probability used during HMM band calculation to <x>. This is the
amount of probability mass within the HMM posterior probabilities that is
considered negligible. The default value is 1E-7. In general, higher values will
result in greater acceleration, but increase the chance of missing the optimal
alignment due to the HMM bands.

--mxsize <x>
Set the maximum allowable total DP matrix size to <x> megabytes. By default this
size is 1028 Mb. This should be large enough for the vast majority of alignments,
however if it is not cmalign will attempt to iteratively tighten the HMM bands it
uses to constrain the alignment by raising the tau parameter and recalculating the
bands until the total matrix size needed falls below <x> megabytes or the maximum
allowable tau value (0.05 by default, but changeable with --maxtau) is reached. At
each iteration of band tightening, tau is multiplied by a 2.0. The band tightening
strategy can be turned off with the --fixedtau option. If the maximum tau is
reached and the required matrix size still exceeds <x> or if HMM banding is not
being used and the required matrix size exceeds <x> then cmalign will exit
prematurely and report an error message that the matrix exceeded its maximum
allowable size. In this case, the --mxsize can be used to raise the size limit or
the maximum tau can be raised with --maxtau. The limit will commonly be exceeded
when the --nonbanded option is used without the --small option, but can still occur
when --nonbanded is not used. Note that if cmalign is being run in <n> multiple
threads on a multicore machine then each thread may have an allocated matrix of up
to size <x> Mb at any given time.

--fixedtau
Turn off the HMM band tightening strategy described in the explanation of the
--mxsize option above.

--maxtau <x>
Set the maximum allowed value for tau during band tightening, described in the
explanation of --mxsize above, to <x>. By default this value is 0.05.

--nonbanded
Turns off HMM banding. The returned alignment is guaranteed to be the globally
optimally accurate one (by default) or the globally optimally scoring one (if --cyk
is enabled). The --small option is recommended in combination with this option,
because standard alignment without HMM banding requires a lot of memory (see
--small ).

--small
Use the divide and conquer CYK alignment algorithm described in SR Eddy, BMC
Bioinformatics 3:18, 2002. The --nonbanded option must be used in combination with
this options. Also, it is recommended whenever --nonbanded is used that --small is
also used because standard CM alignment without HMM banding requires a lot of
memory, especially for large RNAs. --small allows CM alignment within practical
memory limits, reducing the memory required for alignment LSU rRNA, the largest
known RNAs, from 150 Gb to less than 300 Mb. This option can only be used in
combination with --nonbanded, --notrunc, and --cyk.

OPTIONAL OUTPUT FILES

       --sfile <f>
              Dump per-sequence alignment score and timig information to file <f>.  The format of
              this file is described above (it's the same data in the same format as the  tabular
              stdout output when the -o option is used).

       --tfile <f>
              Dump  tabular  sequence  tracebacks  for  each  individual  sequence to a file <f>.
              Primarily useful for debugging.

       --ifile <f>
              Dump per-sequence insert information to file  <f>.   The  format  of  the  file  is
              described  by  "#"-prefixed comment lines included at the top of the file <f>.  The
              insert information is valid even when the --matchonly option is used.

       --elfile <f>
              Dump per-sequence EL state (local end) insert information to file <f>.  The  format
              of  the  file is described by "#"-prefixed comment lines included at the top of the
              file <f>.  The EL insert information is valid even when the --matchonly  option  is
              used.

OTHER OPTIONS

--mapali <f>
Reads the alignment from file <f> used to build the model aligns it as a single
object to the CM; e.g. the alignment in <f> is held fixed. This allows you to
align sequences to a model with cmalign and view them in the context of an existing
trusted multiple alignment. <f> must be the alignment file that the CM was built
from. The program verifies that the checksum of the file matches that of the file
used to construct the CM. A similar option to this one was called --withali in
previous versions of cmalign.

--mapstr
Must be used in combination with --mapali <f>. Propogate structural information
for any pseudoknots that exist in <f> to the output alignment. A similar option to
this one was called --withstr in previous versions of cmalign.

--informat <s>
Assert that the input <seqfile> is in format <s>. Do not run Babelfish format
autodection. This increases the reliability of the program somewhat, because the
Babelfish can make mistakes; particularly recommended for unattended, high-
throughput runs of Infernal. Acceptable formats are: FASTA, GENBANK, and DDBJ.
<s> is case-insensitive.

--outformat <s>
Specify the output alignment format as <s>. Acceptable formats are: Pfam, AFA,
A2M, Clustal, and Phylip. AFA is aligned fasta. Only Pfam and Stockholm alignment
formats will include consensus structure annotation and posterior probability
annotation of aligned residues.

--dnaout
Output the alignments as DNA sequence alignments, instead of RNA ones.

--noprob
Do not annotate the output alignment with posterior probabilities.

--matchonly
Only include match columns in the output alignment, do not include any insertions
relative to the consensus model. This option may be useful when creating very large
alignments that require a lot of memory and disk space, most of which is necessary
only to deal with insert columns that are gaps in most sequences.

--ileaved
Output the alignment in interleaved Stockholm format of a fixed width that may be
more convenient for examination. This was the default output alignment format of
previous versions of cmalign. Note that cmalign requires more memory when this
option is used. For this reason, --ileaved will only work for alignments of up to
100,000 sequences or a total of 100,000,000 aligned nucleotides.

--regress <s>
Save an additional copy of the output alignment with no author information to file
<s>.

--verbose
Output additional information in the tabular scores output (output to stdout if -o
is used, or to <f> if --sfile <f> is used). These are mainly useful for testing and
debugging.

--cpu <n>
Specify that <n> parallel CPU workers be used. If <n> is set as "0", then the
program will be run in serial mode, without using threads. You can also control
this number by setting an environment variable, INFERNAL_NCPU. This option will
only be available if the machine on which Infernal was built is capable of using
POSIX threading (see the Installation section of the user guide for more
information).

--mpi Run as an MPI parallel program. This option will only be available if Infernal has
been configured and built with the "--enable-mpi" flag (see the Installation
section of the user guide for more information).

COPYRIGHT

       Copyright (C) 2016 Howard Hughes Medical Institute.
       Freely distributed under a BSD open source license.

       For  additional  information  on copyright and licensing, see the file called COPYRIGHT in
       your Infernal source distribution, or see the Infernal web page ().

AUTHOR

       The Eddy/Rivas Laboratory
       Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147 USA
       http://eddylab.org