Provided by: infernal_1.0.2-2_amd64 bug

NAME

       cmalign - use a CM to make a structured RNA multiple alignment

SYNOPSIS

       Align sequences to a CM:
              cmalign [options] cmfile seqfile

       Merge two alignments:
              cmalign --merge [options] cmfile msafile1 msafile2

DESCRIPTION

       cmalign  aligns  the  RNA sequences in seqfile to the covariance model (CM) in cmfile, and
       outputs a multiple sequence alignment.  Alternatively, with the  --merge  option,  cmalign
       merges  the  two alignments msafile1 and msafile2 created by previous runs of cmalign with
       cmfile into a single alignment.

       The sequence file seqfile must be in FASTA, EMBL, or Genbank format.

       CMs are profiles of RNA consensus sequence and secondary structure. A CM file is  produced
       by the cmbuild program, from a given RNA sequence alignment of known consensus structure.

       The  alignment that cmalign makes is written in Stockholm format.  It can be redirected to
       a file using the -o option.

       cmalign uses an HMM banding technique to accelerate  alignment  by  default  as  described
       below for the --hbanded option. HMM banding can be turned off with the --nonbanded option.

       By  default,  cmalign  computes  the  alignment  with  maximum  expected  accuracy that is
       consistent with constraints (bands) derived from an HMM, using a  banded  version  of  the
       Durbin/Holmes  optimal  accuracy  algorithm.   This  behavior can be changed, as described
       below and in the User's Guide, with the --cyk, --sample, or --viterbi options.

       It is possible to include the fixed training alignment used to build  the  CM  within  the
       output  alignment of cmalign.  This is done using the --withali option, as described below
       and in the User's Guide.

OUTPUT

       cmalign first outputs tabular information on the scores of each  sequence  being  aligned,
       then  the  alignment  itself is printed. The alignment can be redirected to an output file
       <f> with the -o <f> option.  The tabular output section includes one  line  per  sequences
       and  seven  fields per line:  "seq idx": the index of the sequence in the input file, "seq
       name": the sequence name, "len": the length of the sequence, "total": the total bit  score
       of the sequence, "struct": an approximation of the contribution of the secondary structure
       to the bit score, "avg prob": the average posterior probability (confidence  estimate)  of
       each aligned residue, and "elapsed": the wall time spent aligning the sequence.

       The  fields  can change if different options are selected. For example if the --cyk option
       is enabled, the "avg prob"  field  disappears  because  posterior  probabilities  are  not
       calculated by the CYK algorithm.

OPTIONS

       -h     Print  brief  help;  includes  version number and summary of all options, including
              expert options.

       -o <f> Save the alignment in Stockholm format to a file <f>.  The default is to  write  it
              to standard output.

       -l     Turn  on  the  local alignment algorithm, which allows the alignment to span two or
              more subsequences if necessary (e.g. if the  structures  of  the  query  model  and
              target  sequence  are only partially shared), allowing certain large insertions and
              deletions in the structure to be penalized differently  than  normal  indels.   The
              default is to globally align the query model to the target sequences.

       -p     Annotate the alignment with posterior probabilities calculated using the Inside and
              Outside algorithms.  The -p option causes additional annotation to  appear  in  the
              output  alignment,  but does not modify the alignment itself (that is, the relative
              positions of the residues are unchanged).  Two characters for each residue are used
              to  annotate the posterior probability that the corresponding residue aligns at the
              corresponding position in  the  Stockholm  alignment.  These  characters  have  the
              Stockholm  markup  tags  "#=GR <seq name> POSTX." and "#=GR <seq name> POST.X", and
              can only have the values: "0-9", "*" or ".". They indicate the tens and ones  place
              for the posterior probability: an "8" for "POSTX." and a "3" for "POST.X" indicates
              that the posterior probability is between 0.83 and 0.84. A "*"  for  both  "POSTX."
              and "POST.X" indicates that the confidence estimate is "very nearly" 1.0 (it's hard
              to be exact here due to numerical precision issues) A "."   in  both  "POSTX."  and
              "POST.X"  indicates that that column aligns to a gap. When used in combination with
              --nonbanded, the calculation of the posterior probabilities considers all  possible
              alignments  of  the  target  sequence to the CM. Without --nonbanded (in HMM banded
              mode), the calculation considers only possible alignments within the HMM bands.

       -q     Quiet; suppress the verbose banner, and  only  print  the  resulting  alignment  to
              stdout.  This  allows  piping  the  alignment  to  the input of other programs, for
              example.

       -1     Output the alignment in pfam format, a non-interleaved Stockholm  format  in  which
              each sequence is on a single line.

       --informat <s>
              Assert  that  the  input  seqfile  is  in  format <s>.  Do not run Babelfish format
              autodection. This increases the reliability of the program  somewhat,  because  the
              Babelfish  can  make  mistakes;  particularly  recommended  for  unattended,  high-
              throughput runs  of  Infernal.   Acceptable  formats  are:  FASTA,  EMBL,  UNIPROT,
              GENBANK, and DDBJ.  <s> is case-insensitive.

       --devhelp
              Print  help,  as  with  -h , but also include undocumented developer options. These
              options are not listed below, are under development or experimental,  and  are  not
              guaranteed to even work correctly. Use developer options at your own risk. The only
              resources  for  understanding  what  they  actually  do  are  the  brief   one-line
              description printed when --devhelp is enabled, and the source code.

       --mpi  Run  as an MPI parallel program. This option will only be available if Infernal has
              been configured and built with  the  "--enable-mpi"  flag  (see  User's  Guide  for
              details).

EXPERT OPTIONS

       --optacc
              Align sequences using the Durbin/Holmes optimal accuracy algorithm. This is default
              behavior, so this option is probably useless.  The optimal accuracy alignment  will
              be  constrained  by  HMM  bands  for  acceleration unless the --nonbanded option is
              enabled.  The optimal accuracy algorithm determines the  alignment  that  maximizes
              the  posterior  probabilities  of  the  aligned  residues within it.  The posterior
              probabilites are determined using (possibly HMM banded) variants of the Inside  and
              Outside algorithms.

       --cyk  Do  not  use  the  Durbin/Holmes optimal accuracy alignment to align the sequences,
              instead use the CYK algorithm which determines the optimally scoring  alignment  of
              the sequence to the model.

       --sample
              Sample  an  alignment from the posterior distribution of alignments.  The posterior
              distribution is determined using an HMM banded (unless --nonbanded) variant of  the
              Inside algorithm.

       -s <n> Set  the random number generator seed to <n>, where <n> is a positive integer. This
              option can only be used in combination with --sample.  The default is to use time()
              to  generate  a different seed for each run, which means that two different runs of
              cmalign --sample on the same alignment will give slightly  different  results.  You
              can use this option to generate reproducible results.

       --viterbi
              Do  not use the CM to align the sequences, instead use the HMM Viterbi algorithm to
              align with a CM Plan 9 HMM. The HMM is automatically constructed  to  be  maximally
              similar to the CM.  This HMM alignment is faster than CM alignment, but can be less
              accurate because the structure of the RNA family is ignored.

       --sub  Turn on the sub model construction and alignment procedure. For each  sequence,  an
              HMM  is  first used to predict the model start and end consensus columns, and a new
              sub CM is constructed that only models consensus columns from  start  to  end.  The
              sequence  is  then  aligned  to  this  sub  CM.  This option is useful for aligning
              sequences that are known to truncated, non-full length sequences.   This  "sub  CM"
              procedure is not the same as the "sub CMs" described by Weinberg and Ruzzo.

       --small
              Use  the  divide  and  conquer  CYK  alignment  algorithm described in SR Eddy, BMC
              Bioinformatics 3:18, 2002. The --nonbanded option must be used in combination  with
              this options.  Also, it is recommended whenever --nonbanded is used that --small is
              also used  because standard CM alignment without HMM  banding  requires  a  lot  of
              memory,  especially  for  large RNAs.  --small allows CM alignment within practical
              memory limits, reducing the memory required for alignment  LSU  rRNA,  the  largest
              known  RNAs,  from  150  Gb  to  less than 300 Mb.  This option can only be used in
              combination with --nonbanded and --cyk.

       --hbanded
              This option is turned on by default.  Accelerate alignment by pruning away  regions
              of  the CM DP matrix that are deemed negligible by an HMM.  First, each sequence is
              scored with a CM plan 9 HMM derived from the CM using the Forward and Backward  HMM
              algorithms  and  calculate posterior probabilities that each residue aligns to each
              state of the HMM. These posterior probabilities  are  used  to  derive  constraints
              (bands)  on  the  CM  DP  matrix. Finally, the target sequence is aligned to the CM
              using the banded DP matrix, during which  cells  outside  the  bands  are  ignored.
              Usually  most  of  the full DP matrix lies outside the bands (often more than 95%),
              making this technique faster because fewer DP calculations are required,  and  more
              memory efficient because only cells within the bands need be allocated.

              Importantly,  HMM  banding  sacrifices  the  guarantee of determining the optimally
              accurarte or optimal alignment, which will be missed if it lies outside the  bands.
              The  tau paramater (analagous to the beta parameter for QDB calculation in cmsearch
              ) is  the  amount  of  probability  mass  considered  negligible  during  HMM  band
              calculation;  lower  values of tau yield greater speedups but also a greater chance
              of missing the optimal alignment. The default tau is 1E-7,  determined  empirically
              as  a good tradeoff between sensitivity and speed, though this value can be changed
              with the --tau  <x> option. The level  of  acceleration  increases  with  both  the
              length and primary sequence conservation level of the family. For example, with the
              default tau of 1E-7, tRNA models (low primary sequence conservation with length  of
              about 75 residues) show about 10X acceleration, and SSU bacterial rRNA models (high
              primary sequence conservation with length of about 1500 residues) show about  700X.
              HMM banding can be turned off with the --nonbanded option.

       --nonbanded
              Turns  off  HMM  banding.  The  returned alignment is guaranteed to be the globally
              optimally accurate one (by default) or the globally optimally scoring one (if --cyk
              is  enabled).   The  --small option is recommended in combination with this option,
              because standard alignment without HMM  banding  requires  a  lot  of  memory  (see
              --small ).

       --tau <x>
              Set the tail loss probability used during HMM band calculation to <x>.  This is the
              amount  of  probability  mass  within  the  HMM  posterior  probabilities  that  is
              considered  negligible.  The default value is 1E-7.  In general, higher values will
              result in greater acceleration, but increase the  chance  of  missing  the  optimal
              alignment due to the HMM bands.

       --mxsize <x>
              Set  the maximum allowable DP matrix size to <x> megabytes. By default this size is
              2,048 Mb.  This should be large enough for the vast majority of alignments, however
              if  it  is  not  cmalign will exit prematurely and report an error message that the
              matrix exceeded it's maximum allowable size. In this case, the --mxsize can be used
              to  raise  the  limit.  This is most likely to occur when the --nonbanded option is
              used without the --small option, but can still occur when --nonbanded is not used.

       --rna  Output the alignments as RNA sequence alignments. This is true by default.

       --dna  Output the alignments as DNA sequence alignments.

       --matchonly
              Only include match columns in the output alignment, do not include  any  insertions
              relative to the consensus model.

       --resonly
              Only  include  match  columns  in the output alignment that have at least 1 residue
              (non-gap character) in them. By default  all  match  columns  are  printed  to  the
              alignment,  even  those  that  are  100%  gaps.   --resonly  replicates the default
              behavior of previous versions of cmalign.

       --fins Change the behavior of how insert  emissions  are  placed  in  the  alignment.   By
              default,  all contiguous blocks of inserts are split in half, and half the residues
              are flushed left against the nearest consensus column to the  left,  and  half  are
              flushed  right  against  the  nearest  consensus  column  on the right. With --fins
              inserts are not split in half, instead all inserted residues  from  IL  states  are
              flushed  left,  and all inserted residues from IR states are flushed right.  --fins
              replicates the default behavior of previous versions of cmalign.

       --onepost
              Modifies behavior of the -p option. Use  only  one  character  instead  of  two  to
              annotate  the posterior probability of each aligned residue. Specifically, only the
              "#=GR <seq name> POSTX." tag is printed to  the  alignment.  An  "8"  for  "POSTX."
              indicates  a  posterior  probability  between  0.8  and  0.9  for the corresponding
              residue.

       --merge
              With --merge the usage of cmalign  changes  to  cmalign  --merge  [options]  cmfile
              msafile1  msafile2.   Merge  the two alignments in msafile1 and msafile2 created by
              previous runs of cmalign with cmfile together into a  single  alignment  and  exit.
              msafile1  and  msafile2  must only have one alignment per file.  This option allows
              the user to split up large sequence files  into  many  smaller  files,  align  them
              independently  to  cmfile  on different computers to get many small alignments, and
              then merge them into a single large alignment.

       --withali <f>
              Reads an alignment from file <f> and aligns it as a single object to the  CM;  e.g.
              the  alignment in <f> is held fixed.  This allows you to align sequences to a model
              with cmalign and  view  them  in  the  context  of  an  existing  trusted  multiple
              alignment.  The alignment in the file <f> must be exactly the alignment that the CM
              was built from, or a  subset  of  it  with  the  following  special  property:  the
              definition of consensus columns and consensus secondary structure must be identical
              between <f> and the alignment the CM was built from. One easy way to  achieve  this
              is  to  use  the --rf option to cmbuild (see man page for cmbuild ) and to maintain
              the "#=GC RF" annotation in the alignment when removing  sequences  to  create  the
              subset  alignment <f>.  To specify that the --rf option to cmbuild was used, enable
              the --rf option to cmalign (see --rf below).

       --withpknots
              Must be used in combination with --withali <f>.  Propogate  structural  information
              for any pseudoknots that exist in <f> to the output alignment.

       --rf   Must  be used in combination with --withali <f>.  Specify that the alignment in <f>
              has the same "#=GC RF" annotation as the alignment file the CM was built from using
              cmbuild  and  further  that the --rf option was supplied to cmbuild when the CM was
              constructed.

       --gapthresh <x>
              Must be used in combination with --withali <f>.  Specify that the  --gapthresh  <x>
              option  was supplied to cmbuild when the CM was constructed from the alignment file
              <f>.

       --tfile <f>
              Dump tabular sequence tracebacks for  each  individual  sequence  to  a  file  <f>.
              Primarily useful for debugging.

SEE ALSO

       For  complete  documentation,  see  the  User's  Guide  (Userguide.pdf) that came with the
       distribution; or see the Infernal web page, http://infernal.janelia.org/.

COPYRIGHT

       Copyright (C) 2009 HHMI Janelia Farm Research Campus.
       Freely distributed under the GNU General Public License (GPLv3).
       See the file COPYING that came with the source for details on redistribution conditions.

AUTHOR

       Eric Nawrocki, Diana Kolbe, and Sean Eddy
       HHMI Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147
       http://selab.janelia.org/