xenial (1) cmalign.1.gz

Provided by: infernal_1.1.1-3_amd64 bug

NAME

       cmalign - align sequences to a covariance model

SYNOPSIS

       cmalign
              [options] <cmfile> <seqfile>

DESCRIPTION

       cmalign  aligns  the  RNA  sequences  in  <seqfile>  to  the  covariance model (CM) in <cmfile>.  The new
       alignment is output to stdout in Stockholm format, but can be redirected to a file <f> with  the  -o  <f>
       option.

       Either  <cmfile> or <seqfile> (but not both) may be '-' (dash), which means reading this input from stdin
       rather than a file.

       The sequence file <seqfile> must be in FASTA or Genbank format.

       cmalign uses an HMM banding technique to accelerate alignment by  default  as  described  below  for  the
       --hbanded option. HMM banding can be turned off with the --nonbanded option.

       By  default,  cmalign  computes  the  alignment  with  maximum  expected accuracy that is consistent with
       constraints (bands) derived from an HMM, using a banded version of  the  Durbin/Holmes  optimal  accuracy
       algorithm.  This behavior can be changed with the --cyk or --sample options.

       cmalign  takes  special  care  to  correctly  align  truncated sequences, where some nucleotides from the
       beginning (5') and/or end (3') of the actual full length biological sequence are not present in the input
       sequence  (see DL Kolbe and SR Eddy, Bioinformatics, 25:1236-1243, 2009). This behavior is on by default,
       but can be turned off with --notrunc.  In previous versions of cmalign the --sub option was  required  to
       appropriately  handle  truncated  sequences. The --sub option is still available in this version, but the
       new default method for handling truncated sequences should be as good or superior to the  sub  method  in
       nearly all cases.

       The  --mapali  <s> option allows inclusion of the fixed training alignment used to build the CM from file
       <s> within the output alignment of cmalign.

       It is possible to merge two or more alignments created by the  same  CM  using  the  Easel  miniapp  esl-
       alimerge  (included  in  the  easel/miniapps/  subdirectory  of  Infernal).  Previous versions of cmalign
       included options to merge alignments but they were deprecated upon development of esl-alimerge, which  is
       significantly more memory efficient.

       By  default,  cmalign  will output the alignment to stdout.  The alignment can be redirected to an output
       file <f> with the -o <f> option. With -o, information on each aligned sequence, including score and model
       alignment boundaries will be printed to stdout (more on this below).

       The  output  alignment will be in Stockholm format by default. This can be changed to Pfam, aligned FASTA
       (AFA), A2M, Clustal, or Phylip format using the --outformat <s> option, where <s>  is  the  name  of  the
       desired  format.  As a special case, if the output alignment is large (more than 10,000 sequences or more
       than 10,000,000 total nucleotides) than the output  format  will  be  Pfam  format,  with  each  sequence
       appearing  on  a  single  line,  for reasons of memory efficiency. For alignments larger than this, using
       --ileaved will force interleaved Stockholm format, but the user should be aware that this may  require  a
       lot  of  memory.   --ileaved  will  only work for alignments up to 100,000 sequences or 100,000,000 total
       nucleotides.

       If the output alignment format is Stockholm  or  Pfam,  the  output  alignment  will  be  annotated  with
       posterior  probabilities which estimate the confidence level of each aligned nucleotide.  This annotation
       appears as lines beginning with "#=GR <seq name> PP",  one  per  sequence,  each  immediately  below  the
       corresponding  aligned sequence "<seq name>". Characters in PP lines have 12 possible values: "0-9", "*",
       or ".". If ".", the position corresponds to a gap in the sequence. A value of "0" indicates  a  posterior
       probability  of between 0.0 and 0.05, "1" indicates between 0.05 and 0.15, "2" indicates between 0.15 and
       0.25 and so on up to "9" which indicates between 0.85 and 0.95. A value  of  "*"  indicates  a  posterior
       probability of between 0.95 and 1.0. Higher posterior probabilities correspond to greater confidence that
       the aligned nucleotide belongs where it appears in the alignment.  With --nonbanded, the  calculation  of
       the  posterior  probabilities considers all possible alignments of the target sequence to the CM. Without
       --nonbanded (i.e. in default mode), the calculation considers only possible  alignments  within  the  HMM
       bands.  Further, the posterior probabilities are conditional on the truncation mode of the alignment. For
       example, if the sequence alignment is truncated 5', a PP value of "9" indicates between 0.85 and 0.95  of
       all 5' truncated alignments include the given nucleotide at the given position.  The posterior annotation
       can be turned off with the --noprob option. If --small is enabled,  posterior  annotation  must  also  be
       turned off using --noprob.

       The  tabular output that is printed to stdout if the -o option is used includes one line per sequence and
       twelve fields per line: "idx": the index of the sequence in the input  file,  "seq  name":  the  sequence
       name;  "length":  the length of the sequence; "cm from" and "cm to": the model start and end positions of
       the alignment; "trunc": "no" if the sequence is not truncated, "5'" if  the  beginning  of  the  sequence
       truncated 5', "3'" if the end of the sequence is truncated, and "5'&3'" if both the beginning and the end
       are truncated; "bit sc": the bit score of the alignment, "avg pp" the average  posterior  probability  of
       all  aligned  nucleotides  in  the  alignment;  "band calc", "alignment" and "total": the time in seconds
       required for calculating HMM bands, computing the alignment, and complete  processing  of  the  sequence,
       respectively;  "mem  (Mb)":  the size in Mb of all dynamic programming matrices required for aligning the
       sequence.  This tabular data can be saved to file <f> with the --sfile <f> option.

OPTIONS

       -h     Help; print a brief reminder of command line usage and available options.

       -o <f> Save the alignment in Stockholm format to a file <f>.  The default is  to  write  it  to  standard
              output.

       -g     Configure  the  model for global alignment of the query model to the target sequences. By default,
              the model is configured for local alignment. Local alignments can  contain  large  insertions  and
              deletions  called  "local  ends"  in the structure to be penalized differently than normal indels.
              These are annotated as "~" columns in the RF line of the output alignment. The -g  option  can  be
              used to disallow these local ends.  The -g option is required if the --sub option is also used.

OPTIONS FOR CONTROLLING THE ALIGNMENT ALGORITHM

       --optacc
              Align  sequences  using  the  Durbin/Holmes  optimal accuracy algorithm. This is the default.  The
              optimal accuracy  alignment  will  be  constrained  by  HMM  bands  for  acceleration  unless  the
              --nonbanded  option  is  enabled.   The  optimal  accuracy algorithm determines the alignment that
              maximizes the posterior probabilities  of  the  aligned  nucleotides  within  it.   The  posterior
              probabilites  are  determined  using  (possibly  HMM  banded)  variants  of the Inside and Outside
              algorithms.

       --cyk  Do not use the Durbin/Holmes optimal accuracy alignment to align the sequences,  instead  use  the
              CYK  algorithm  which  determines  the  optimally  scoring  (maximum  likelihood) alignment of the
              sequence to the model, given the HMM bands (unless --nonbanded is also enabled).

       --sample
              Sample an alignment from the posterior distribution of alignments.  The posterior distribution  is
              determined using an HMM banded (unless --nonbanded) variant of the Inside algorithm.

       --seed <n>
              Seed  the  random  number  generator  with  <n>, an integer >= 0.  This option can only be used in
              combination with --sample.   If  <n>  is  nonzero,  stochastic  sampling  of  alignments  will  be
              reproducible;  the  same  command  will  give  the  same  results.  If <n> is 0, the random number
              generator is seeded arbitrarily, and stochastic samplings may vary from run to  run  of  the  same
              command.  The default seed is 181.

       --notrunc
              Turn  off  truncated  alignment algorithms.  All sequences in the input file will be assumed to be
              full length, unless --sub is also used, in which case  the  program  can  still  handle  truncated
              sequences but will use an alternative strategy for their alignment.

       --sub  Turn  on  the  sub  model construction and alignment procedure. For each sequence, an HMM is first
              used to predict the model start and end consensus columns, and a new sub CM  is  constructed  that
              only models consensus columns from start to end. The sequence is then aligned to this sub CM.  Sub
              alignment is an older method than the  default  one  for  aligning  sequences  that  are  possibly
              truncated.  By  default,  cmalign  uses  special DP algorithms to handle truncated sequences which
              should be more accurate than the sub method in most cases.  --sub is still included as  an  option
              mainly  for  testing against this default truncated sequence handling.  This "sub CM" procedure is
              not the same as the "sub CMs" described by Weinberg and Ruzzo.

OPTIONS FOR CONTROLLING SPEED AND MEMORY REQUIREMENTS

       --hbanded
              This option is turned on by default. Accelerate alignment by pruning away regions  of  the  CM  DP
              matrix  that are deemed negligible by an HMM.  First, each sequence is scored with a CM plan 9 HMM
              derived from the CM  using  the  Forward  and  Backward  HMM  algorithms  to  calculate  posterior
              probabilities  that each nucleotide aligns to each state of the HMM. These posterior probabilities
              are used to derive constraints (bands) on the CM  DP  matrix.  Finally,  the  target  sequence  is
              aligned  to  the  CM using the banded DP matrix, during which cells outside the bands are ignored.
              Usually most of the full DP matrix lies outside the bands  (often  more  than  95%),  making  this
              technique  faster  because  fewer  DP calculations are required, and more memory efficient because
              only cells within the bands need be allocated.

              Importantly, HMM banding sacrifices the  guarantee  of  determining  the  optimally  accurarte  or
              optimal  alignment,  which  will  be missed if it lies outside the bands. The tau paramater is the
              amount of probability mass considered negligible during HMM band calculation; lower values of  tau
              yield greater speedups but also a greater chance of missing the optimal alignment. The default tau
              is 1E-7, determined empirically as a good tradeoff between  sensitivity  and  speed,  though  this
              value can be changed with the --tau  <x> option. The level of acceleration increases with both the
              length and primary sequence conservation level of the family. For example, with the default tau of
              1E-7,  tRNA  models  (low  primary sequence conservation with length of about 75 nucleotides) show
              about 10X acceleration, and SSU bacterial rRNA models (high  primary  sequence  conservation  with
              length  of  about  1500  nucleotides)  show  about  700X.   HMM banding can be turned off with the
              --nonbanded option.

       --tau <x>
              Set the tail loss probability used during HMM band calculation to <x>.   This  is  the  amount  of
              probability mass within the HMM posterior probabilities that is considered negligible. The default
              value is 1E-7.  In general, higher values will result in greater acceleration,  but  increase  the
              chance of missing the optimal alignment due to the HMM bands.

       --mxsize <x>
              Set  the maximum allowable total DP matrix size to <x> megabytes. By default this size is 1028 Mb.
              This should be large enough for the vast majority of alignments, however if it is not cmalign will
              attempt to iteratively tighten the HMM bands it uses to constrain the alignment by raising the tau
              parameter and recalculating the bands until the total matrix size needed falls below <x> megabytes
              or  the maximum allowable tau value (0.05 by default, but changeable with --maxtau) is reached. At
              each iteration of band tightening, tau is multiplied by a 2.0. The band tightening strategy can be
              turned off with the --fixedtau option.  If the maximum tau is reached and the required matrix size
              still exceeds <x> or if HMM banding is not being used and the required  matrix  size  exceeds  <x>
              then  cmalign  will  exit  prematurely  and  report  an error message that the matrix exceeded its
              maximum allowable size. In this case, the --mxsize can be used to raise  the  size  limit  or  the
              maximum tau can be raised with --maxtau.  The limit will commonly be exceeded when the --nonbanded
              option is used without the --small option, but can still occur when --nonbanded is not used.  Note
              that  if  cmalign is being run in <n> multiple threads on a multicore machine then each thread may
              have an allocated matrix of up to size <x> Mb at any given time.

       --fixedtau
              Turn off the HMM band tightening strategy described in the  explanation  of  the  --mxsize  option
              above.

       --maxtau <x>
              Set  the  maximum  allowed  value  for tau during band tightening, described in the explanation of
              --mxsize above, to <x>.  By default this value is 0.05.

       --nonbanded
              Turns off HMM banding. The returned alignment is guaranteed to be the globally optimally  accurate
              one  (by default) or the globally optimally scoring one (if --cyk is enabled).  The --small option
              is recommended in combination with this option, because standard  alignment  without  HMM  banding
              requires a lot of memory (see --small ).

       --small
              Use  the divide and conquer CYK alignment algorithm described in SR Eddy, BMC Bioinformatics 3:18,
              2002. The --nonbanded option must  be  used  in  combination  with  this  options.   Also,  it  is
              recommended  whenever --nonbanded is used that --small is also used  because standard CM alignment
              without HMM banding requires a lot of memory,  especially  for  large  RNAs.   --small  allows  CM
              alignment within practical memory limits, reducing the memory required for alignment LSU rRNA, the
              largest known RNAs, from 150 Gb to less than 300 Mb.  This option can only be used in  combination
              with --nonbanded, --notrunc, and --cyk.

OPTIONAL OUTPUT FILES

       --sfile <f>
              Dump  per-sequence  alignment score and timig information to file <f>.  The format of this file is
              described above (it's the same data in the same format as the tabular stdout output  when  the  -o
              option is used).

       --tfile <f>
              Dump tabular sequence tracebacks for each individual sequence to a file <f>.  Primarily useful for
              debugging.

       --ifile <f>
              Dump per-sequence insert information to file  <f>.   The  format  of  the  file  is  described  by
              "#"-prefixed  comment  lines included at the top of the file <f>.  The insert information is valid
              even when the --matchonly option is used.

       --elfile <f>
              Dump per-sequence EL state (local end) insert information to file <f>.  The format of the file  is
              described  by  "#"-prefixed  comment  lines  included  at  the top of the file <f>.  The EL insert
              information is valid even when the --matchonly option is used.

OTHER OPTIONS

       --mapali <f>
              Reads the alignment from file <f> used to build the model aligns it as a single object to the  CM;
              e.g.  the  alignment  in  <f>  is  held fixed.  This allows you to align sequences to a model with
              cmalign and view them in the context of an existing trusted multiple alignment.  <f> must  be  the
              alignment  file  that  the  CM  was built from. The program verifies that the checksum of the file
              matches that of the file used to construct the CM.  A  similar  option  to  this  one  was  called
              --withali in previous versions of cmalign.

       --mapstr
              Must  be  used  in  combination  with  --mapali  <f>.   Propogate  structural  information for any
              pseudoknots that exist in <f> to the output alignment. A similar option to  this  one  was  called
              --withstr in previous versions of cmalign.

       --informat <s>
              Assert  that  the input <seqfile> is in format <s>.  Do not run Babelfish format autodection. This
              increases the reliability of the program  somewhat,  because  the  Babelfish  can  make  mistakes;
              particularly  recommended  for  unattended,  high-throughput runs of Infernal.  Acceptable formats
              are: FASTA, GENBANK, and DDBJ.  <s> is case-insensitive.

       --outformat <s>
              Specify the output alignment format as <s>.  Acceptable formats are: Pfam, AFA, A2M, Clustal,  and
              Phylip.   AFA  is  aligned fasta. Only Pfam and Stockholm alignment formats will include consensus
              structure annotation and posterior probability annotation of aligned residues.

       --dnaout
              Output the alignments as DNA sequence alignments, instead of RNA ones.

       --noprob
              Do not annotate the output alignment with posterior probabilities.

       --matchonly
              Only include match columns in the output alignment, do not include any insertions relative to  the
              consensus  model. This option may be useful when creating very large alignments that require a lot
              of memory and disk space, most of which is necessary only to deal with  insert  columns  that  are
              gaps in most sequences.

       --ileaved
              Output  the alignment in interleaved Stockholm format of a fixed width that may be more convenient
              for examination. This was the default output alignment format of  previous  versions  of  cmalign.
              Note  that cmalign requires more memory when this option is used.  For this reason, --ileaved will
              only work for alignments of up to 100,000 sequences or a total of 100,000,000 aligned nucleotides.

       --regress <s>
              Save an additional copy of the output alignment with no author information to file <s>.

       --verbose
              Output additional information in the tabular scores output (output to stdout if -o is used, or  to
              <f> if --sfile <f> is used). These are mainly useful for testing and debugging.

       --cpu <n>
              Specify  that <n> parallel CPU workers be used. If <n> is set as "0", then the program will be run
              in serial mode, without using threads.  You can also control this number by setting an environment
              variable,  INFERNAL_NCPU.  This option will only be available if the machine on which Infernal was
              built is capable of using POSIX threading (see the Installation section of the user guide for more
              information).

       --mpi  Run as an MPI parallel program. This option will only be available if Infernal has been configured
              and built with the "--enable-mpi" flag (see the Installation section of the user  guide  for  more
              information).

SEE ALSO

       See  infernal(1)  for  a  master man page with a list of all the individual man pages for programs in the
       Infernal package.

       For complete documentation, see the user guide that came with your Infernal distribution (Userguide.pdf);
       or see the Infernal web page ().

       Copyright (C) 2014 Howard Hughes Medical Institute.
       Freely distributed under the GNU General Public License (GPLv3).

       For  additional  information  on  copyright and licensing, see the file called COPYRIGHT in your Infernal
       source distribution, or see the Infernal web page ().

AUTHOR

       The Eddy/Rivas Laboratory
       Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147 USA
       http://eddylab.org