lunar (1) cmalign.1.gz

Provided by: infernal_1.1.4-1_amd64 bug

NAME

       cmalign - align sequences to a covariance model

SYNOPSIS

       cmalign
              [options] <cmfile> <seqfile>

DESCRIPTION

       cmalign  aligns  the  RNA sequences in <seqfile> to the covariance model (CM) in <cmfile>.
       The new alignment is output to stdout in Stockholm format, but can be redirected to a file
       <f> with the -o <f> option.

       Either  <cmfile>  or  <seqfile> (but not both) may be '-' (dash), which means reading this
       input from stdin rather than a file.

       The sequence file <seqfile> must be in FASTA or Genbank format.

       cmalign uses an HMM banding technique to accelerate  alignment  by  default  as  described
       below for the --hbanded option. HMM banding can be turned off with the --nonbanded option.

       By  default,  cmalign  computes  the  alignment  with  maximum  expected  accuracy that is
       consistent with constraints (bands) derived from an HMM, using a  banded  version  of  the
       Durbin/Holmes  optimal accuracy algorithm.  This behavior can be changed with the --cyk or
       --sample options.

       cmalign takes special care to correctly align truncated sequences, where some  nucleotides
       from  the beginning (5') and/or end (3') of the actual full length biological sequence are
       not present in the input sequence (see DL Kolbe and SR Eddy, Bioinformatics, 25:1236-1243,
       2009).  This behavior is on by default, but can be turned off with --notrunc.  In previous
       versions of cmalign the --sub  option  was  required  to  appropriately  handle  truncated
       sequences. The --sub option is still available in this version, but the new default method
       for handling truncated sequences should be as good or superior to the sub method in nearly
       all cases.

       The --mapali <s> option allows inclusion of the fixed training alignment used to build the
       CM from file <s> within the output alignment of cmalign.

       It is possible to merge two or more alignments created by the  same  CM  using  the  Easel
       miniapp  esl-alimerge (included in the easel/miniapps/ subdirectory of Infernal). Previous
       versions of cmalign included options to merge alignments but  they  were  deprecated  upon
       development of esl-alimerge, which is significantly more memory efficient.

       By  default, cmalign will output the alignment to stdout.  The alignment can be redirected
       to an output file <f> with the -o  <f>  option.  With  -o,  information  on  each  aligned
       sequence,  including  score and model alignment boundaries will be printed to stdout (more
       on this below).

       The output alignment will be in Stockholm format by default. This can be changed to  Pfam,
       aligned  FASTA  (AFA),  A2M,  Clustal,  or Phylip format using the --outformat <s> option,
       where <s> is the name of the desired format.  As a special case, if the  output  alignment
       is  large  (more than 10,000 sequences or more than 10,000,000 total nucleotides) than the
       output format will be Pfam format, with each sequence appearing  on  a  single  line,  for
       reasons  of memory efficiency. For alignments larger than this, using --ileaved will force
       interleaved Stockholm format, but the user should be aware that this may require a lot  of
       memory.   --ileaved  will  only work for alignments up to 100,000 sequences or 100,000,000
       total nucleotides.

       If the output alignment format  is  Stockholm  or  Pfam,  the  output  alignment  will  be
       annotated with posterior probabilities which estimate the confidence level of each aligned
       nucleotide.  This annotation appears as lines beginning with "#=GR <seq name> PP", one per
       sequence,  each  immediately  below  the  corresponding  aligned  sequence  "<seq  name>".
       Characters in PP lines have 12 possible values: "0-9", "*", or ".". If ".",  the  position
       corresponds  to a gap in the sequence. A value of "0" indicates a posterior probability of
       between 0.0 and 0.05, "1" indicates between 0.05 and 0.15, "2" indicates between 0.15  and
       0.25 and so on up to "9" which indicates between 0.85 and 0.95. A value of "*" indicates a
       posterior probability of between 0.95 and 1.0. Higher posterior  probabilities  correspond
       to  greater  confidence  that  the  aligned  nucleotide  belongs  where  it appears in the
       alignment.  With --nonbanded, the calculation of the posterior probabilities considers all
       possible alignments of the target sequence to the CM. Without --nonbanded (i.e. in default
       mode), the calculation considers only possible alignments within the HMM  bands.  Further,
       the  posterior  probabilities are conditional on the truncation mode of the alignment. For
       example, if the sequence alignment is truncated 5', a PP value of  "9"  indicates  between
       0.85  and  0.95  of  all 5' truncated alignments include the given nucleotide at the given
       position.  The posterior annotation can be turned off with the --noprob option. If --small
       is enabled, posterior annotation must also be turned off using --noprob.

       The  tabular  output  that is printed to stdout if the -o option is used includes one line
       per sequence and twelve fields per line: "idx": the index of the  sequence  in  the  input
       file,  "seq  name": the sequence name; "length": the length of the sequence; "cm from" and
       "cm to": the model start and end positions of the alignment; "trunc": "no" if the sequence
       is  not  truncated, "5'" if the beginning of the sequence truncated 5', "3'" if the end of
       the sequence is truncated, and "5'&3'" if both the beginning and the  end  are  truncated;
       "bit  sc":  the  bit score of the alignment, "avg pp" the average posterior probability of
       all aligned nucleotides in the alignment; "band calc", "alignment" and "total":  the  time
       in  seconds  required  for  calculating  HMM  bands, computing the alignment, and complete
       processing of the sequence, respectively; "mem (Mb)":  the  size  in  Mb  of  all  dynamic
       programming  matrices  required for aligning the sequence.  This tabular data can be saved
       to file <f> with the --sfile <f> option.

OPTIONS

       -h     Help; print a brief reminder of command line usage and available options.

       -o <f> Save the alignment in Stockholm format to a file <f>.  The default is to  write  it
              to standard output.

       -g     Configure  the  model  for  global  alignment  of  the  query  model  to the target
              sequences.  By  default,  the  model  is  configured  for  local  alignment.  Local
              alignments  can  contain  large insertions and deletions called "local ends" in the
              structure to be penalized differently than normal indels. These  are  annotated  as
              "~"  columns  in  the RF line of the output alignment. The -g option can be used to
              disallow these local ends.  The -g option is required if the --sub option  is  also
              used.

OPTIONS FOR CONTROLLING THE ALIGNMENT ALGORITHM

       --optacc
              Align  sequences  using  the  Durbin/Holmes optimal accuracy algorithm. This is the
              default.  The optimal accuracy alignment will  be  constrained  by  HMM  bands  for
              acceleration  unless  the  --nonbanded  option  is  enabled.   The optimal accuracy
              algorithm determines the alignment that maximizes the  posterior  probabilities  of
              the aligned nucleotides within it.  The posterior probabilites are determined using
              (possibly HMM banded) variants of the Inside and Outside algorithms.

       --cyk  Do not use the Durbin/Holmes optimal accuracy alignment  to  align  the  sequences,
              instead  use  the  CYK  algorithm  which  determines the optimally scoring (maximum
              likelihood) alignment of the sequence to the model, given  the  HMM  bands  (unless
              --nonbanded is also enabled).

       --sample
              Sample  an  alignment from the posterior distribution of alignments.  The posterior
              distribution is determined using an HMM banded (unless --nonbanded) variant of  the
              Inside algorithm.

       --seed <n>
              Seed  the  random number generator with <n>, an integer >= 0.  This option can only
              be used in combination with --sample.  If <n> is nonzero,  stochastic  sampling  of
              alignments  will  be reproducible; the same command will give the same results.  If
              <n> is 0, the  random  number  generator  is  seeded  arbitrarily,  and  stochastic
              samplings may vary from run to run of the same command.  The default seed is 181.

       --notrunc
              Turn  off  truncated alignment algorithms.  All sequences in the input file will be
              assumed to be full length, unless --sub is also used, in which case the program can
              still  handle  truncated  sequences  but will use an alternative strategy for their
              alignment.

       --sub  Turn on the sub model construction and alignment procedure. For each  sequence,  an
              HMM  is  first used to predict the model start and end consensus columns, and a new
              sub CM is constructed that only models consensus columns from  start  to  end.  The
              sequence is then aligned to this sub CM.  Sub alignment is an older method than the
              default one for aligning sequences that are possibly truncated. By default, cmalign
              uses  special  DP  algorithms  to  handle  truncated sequences which should be more
              accurate than the sub method in most cases.  --sub is still included as  an  option
              mainly for testing against this default truncated sequence handling.  This "sub CM"
              procedure is not the same as the "sub CMs" described by Weinberg and Ruzzo.

OPTIONS FOR CONTROLLING SPEED AND MEMORY REQUIREMENTS

       --hbanded
              This option is turned on by default. Accelerate alignment by pruning  away  regions
              of  the CM DP matrix that are deemed negligible by an HMM.  First, each sequence is
              scored with a CM plan 9 HMM derived from the CM using the Forward and Backward  HMM
              algorithms to calculate posterior probabilities that each nucleotide aligns to each
              state of the HMM. These posterior probabilities  are  used  to  derive  constraints
              (bands)  on  the  CM  DP  matrix. Finally, the target sequence is aligned to the CM
              using the banded DP matrix, during which  cells  outside  the  bands  are  ignored.
              Usually  most  of  the full DP matrix lies outside the bands (often more than 95%),
              making this technique faster because fewer DP calculations are required,  and  more
              memory efficient because only cells within the bands need be allocated.

              Importantly,  HMM  banding  sacrifices  the  guarantee of determining the optimally
              accurarte or optimal alignment, which will be missed if it lies outside the  bands.
              The  tau  parameter  is the amount of probability mass considered negligible during
              HMM band calculation; lower values of tau yield greater speedups but also a greater
              chance  of  missing  the  optimal  alignment.  The  default tau is 1E-7, determined
              empirically as a good tradeoff between sensitivity and speed, though this value can
              be  changed  with  the  --tau  <x> option. The level of acceleration increases with
              both the length and primary sequence conservation level of the family. For example,
              with  the  default tau of 1E-7, tRNA models (low primary sequence conservation with
              length of about 75 nucleotides) show about 10X acceleration, and SSU bacterial rRNA
              models  (high  primary sequence conservation with length of about 1500 nucleotides)
              show about 700X.  HMM banding can be turned off with the --nonbanded option.

       --tau <x>
              Set the tail loss probability used during HMM band calculation to <x>.  This is the
              amount  of  probability  mass  within  the  HMM  posterior  probabilities  that  is
              considered negligible. The default value is 1E-7.  In general, higher  values  will
              result  in  greater  acceleration,  but  increase the chance of missing the optimal
              alignment due to the HMM bands.

       --mxsize <x>
              Set the maximum allowable total DP matrix size to <x> megabytes.  By  default  this
              size  is 1028 Mb.  This should be large enough for the vast majority of alignments,
              however if it is not cmalign will attempt to iteratively tighten the HMM  bands  it
              uses  to constrain the alignment by raising the tau parameter and recalculating the
              bands until the total matrix size needed falls below <x> megabytes or  the  maximum
              allowable  tau value (0.05 by default, but changeable with --maxtau) is reached. At
              each iteration of band tightening, tau is multiplied by a 2.0. The band  tightening
              strategy  can  be  turned  off  with  the --fixedtau option.  If the maximum tau is
              reached and the required matrix size still exceeds <x> or if  HMM  banding  is  not
              being  used  and  the  required  matrix  size  exceeds  <x>  then cmalign will exit
              prematurely and report an error  message  that  the  matrix  exceeded  its  maximum
              allowable  size.  In this case, the --mxsize can be used to raise the size limit or
              the maximum tau can be raised with --maxtau.  The limit will commonly  be  exceeded
              when the --nonbanded option is used without the --small option, but can still occur
              when --nonbanded is not used. Note that if cmalign is being  run  in  <n>  multiple
              threads  on a multicore machine then each thread may have an allocated matrix of up
              to size <x> Mb at any given time.

       --fixedtau
              Turn off the HMM band tightening strategy  described  in  the  explanation  of  the
              --mxsize option above.

       --maxtau <x>
              Set  the  maximum  allowed  value  for tau during band tightening, described in the
              explanation of --mxsize above, to <x>.  By default this value is 0.05.

       --nonbanded
              Turns off HMM banding. The returned alignment is  guaranteed  to  be  the  globally
              optimally accurate one (by default) or the globally optimally scoring one (if --cyk
              is enabled).  The --small option is recommended in combination  with  this  option,
              because  standard  alignment  without  HMM  banding  requires  a lot of memory (see
              --small ).

       --small
              Use the divide and conquer CYK  alignment  algorithm  described  in  SR  Eddy,  BMC
              Bioinformatics  3:18, 2002. The --nonbanded option must be used in combination with
              this options.  Also, it is recommended whenever --nonbanded is used that --small is
              also  used   because  standard  CM  alignment without HMM banding requires a lot of
              memory, especially for large RNAs.  --small allows CM  alignment  within  practical
              memory  limits,  reducing  the  memory required for alignment LSU rRNA, the largest
              known RNAs, from 150 Gb to less than 300 Mb.  This  option  can  only  be  used  in
              combination with --nonbanded, --notrunc, and --cyk.

OPTIONAL OUTPUT FILES

       --sfile <f>
              Dump per-sequence alignment score and timig information to file <f>.  The format of
              this file is described above (it's the same data in the same format as the  tabular
              stdout output when the -o option is used).

       --tfile <f>
              Dump  tabular  sequence  tracebacks  for  each  individual  sequence to a file <f>.
              Primarily useful for debugging.

       --ifile <f>
              Dump per-sequence insert information to file  <f>.   The  format  of  the  file  is
              described  by  "#"-prefixed comment lines included at the top of the file <f>.  The
              insert information is valid even when the --matchonly option is used.

       --elfile <f>
              Dump per-sequence EL state (local end) insert information to file <f>.  The  format
              of  the  file is described by "#"-prefixed comment lines included at the top of the
              file <f>.  The EL insert information is valid even when the --matchonly  option  is
              used.

OTHER OPTIONS

       --mapali <f>
              Reads  the  alignment  from  file <f> used to build the model aligns it as a single
              object to the CM; e.g. the alignment in <f> is held  fixed.   This  allows  you  to
              align sequences to a model with cmalign and view them in the context of an existing
              trusted multiple alignment.  <f> must be the alignment file that the CM  was  built
              from.  The  program verifies that the checksum of the file matches that of the file
              used to construct the CM. A similar option to this  one  was  called  --withali  in
              previous versions of cmalign.

       --mapstr
              Must  be  used  in combination with --mapali <f>.  Propagate structural information
              for any pseudoknots that exist in <f> to the output alignment. A similar option  to
              this one was called --withstr in previous versions of cmalign.

       --informat <s>
              Assert  that  the  input  <seqfile>  is in format <s>.  Do not run Babelfish format
              autodection. This increases the reliability of the program  somewhat,  because  the
              Babelfish  can  make  mistakes;  particularly  recommended  for  unattended,  high-
              throughput runs of Infernal.  Acceptable formats are:  FASTA,  GENBANK,  and  DDBJ.
              <s> is case-insensitive.

       --outformat <s>
              Specify  the  output  alignment  format as <s>.  Acceptable formats are: Pfam, AFA,
              A2M, Clustal, and Phylip.  AFA is aligned fasta. Only Pfam and Stockholm  alignment
              formats  will  include  consensus  structure  annotation  and posterior probability
              annotation of aligned residues.

       --dnaout
              Output the alignments as DNA sequence alignments, instead of RNA ones.

       --noprob
              Do not annotate the output alignment with posterior probabilities.

       --matchonly
              Only include match columns in the output alignment, do not include  any  insertions
              relative to the consensus model. This option may be useful when creating very large
              alignments that require a lot of memory and disk space, most of which is  necessary
              only to deal with insert columns that are gaps in most sequences.

       --ileaved
              Output  the  alignment in interleaved Stockholm format of a fixed width that may be
              more convenient for examination. This was the default output  alignment  format  of
              previous  versions  of  cmalign.   Note that cmalign requires more memory when this
              option is used.  For this reason, --ileaved will only work for alignments of up  to
              100,000 sequences or a total of 100,000,000 aligned nucleotides.

       --regress <s>
              Save  an additional copy of the output alignment with no author information to file
              <s>.

       --verbose
              Output additional information in the tabular scores output (output to stdout if  -o
              is used, or to <f> if --sfile <f> is used). These are mainly useful for testing and
              debugging.

       --cpu <n>
              Specify that <n> parallel CPU workers be used. If <n>  is  set  as  "0",  then  the
              program  will  be  run in serial mode, without using threads.  You can also control
              this number by setting an environment variable, INFERNAL_NCPU.   This  option  will
              only  be  available  if the machine on which Infernal was built is capable of using
              POSIX  threading  (see  the  Installation  section  of  the  user  guide  for  more
              information).

       --mpi  Run  as an MPI parallel program. This option will only be available if Infernal has
              been configured and built  with  the  "--enable-mpi"  flag  (see  the  Installation
              section of the user guide for more information).

SEE ALSO

       See  infernal(1)  for  a  master  man page with a list of all the individual man pages for
       programs in the Infernal package.

       For complete documentation, see the user guide that came with your  Infernal  distribution
       (Userguide.pdf); or see the Infernal web page (http://eddylab.org/infernal/).

       Copyright (C) 2020 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For  additional  information  on copyright and licensing, see the file called COPYRIGHT in
       your    Infernal    source    distribution,    or    see    the    Infernal    web    page
       (http://eddylab.org/infernal/).

AUTHOR

       http://eddylab.org