Provided by: infernal_1.0.2-2_amd64 bug

NAME

       cmbuild - construct a CM from an RNA multiple sequence alignment

SYNOPSIS

       cmbuild [options] cmfile alifile

DESCRIPTION

       cmbuild  read  an  RNA  multiple  sequence alignment from alifile, constructs a covariance
       model (CM), and saves the CM to cmfile.

       The alignment file must be in Stockholm  format,  and  must  contain  consensus  secondary
       structure  annotation.  cmbuild uses the consensus structure to determine the architecture
       of the CM.

       The alignment file may be a database containing more  than  one  alignment.   If  so,  the
       resulting cmfile will be a database of CMs, one per alignment.

       The  expert  options  --ctarget, --cmindiff, and --call result in multiple CMs being built
       from each alignment in alifile as described below.

OUTPUT

       The default output from cmbuild is tabular, with a single line printed for  each  model  .
       Each  line has the following fields: aln: the index of the alignment used to build the CM,
       cm idx: the index of the CM in the cmfile; name: the name of the CM, nseq: the  number  of
       sequences  in  the  alignment  used  to  build  the  CM, eff_nseq: the effective number of
       sequences used to build the model (see the User Guide); alen: the length of the  alignment
       used  to build the CM; clen: the number of columns from the alignment defined as consensus
       columns; rel entropy, CM: the total relative entropy of the model divided by the number of
       consensus  columns;  rel  entropy,  HMM:  the total relative entropy of the model ignoring
       secondary structure divided by the number of consensus columns.

OPTIONS

       -h     Print brief help; includes version number and summary  of  all  options,  including
              expert options.

       -n <s> Name  the  covariance  model <s>.  (Does not work if alifile contains more than one
              alignment).  The default is to use the name of the alignment (given by the #=GF  ID
              tag,  in  Stockholm  format),  or  if  that  is not present, to use the name of the
              alignment file minus any file type extension plus a  "-"  and  a  positive  integer
              indicating the position of that alignment in the file (that is, the first alignment
              in a file "myrnas.sto" would give a CM named "myrnas-1", the second alignment would
              give a CM named "myrnas-2").

       -A     Append the CM to cmfile, if cmfile already exists.

       -F     Allow  cmfile  to be overwritten. Normally, if cmfile already exists, cmbuild exits
              with an error unless the -A or -F option is set.

       -v     Run in verbose output mode instead of using the default single line tabular format.
              This output format is similar to that used by older versions of Infernal.

       --iins Allow  informative insert emissions for the CM.  By default, all CM insert emission
              scores are set to 0.0 bits.  The motivation for zero bit scores is to  avoid  high-
              scoring  hits  to  low  complexity  sequence  favored by high insert state emission
              scores.

       --Wbeta<x>
              Set the beta tail loss probability for query-dependent banding (QDB) to <x> The QDB
              algorithm  is used to determine the maximium length of a hit to the model. For more
              information on QDB see (Nawrocki and Eddy, PLoS Computational Biology  3(3):  e56).
              The  beta  paramater is the amount of probability mass considered negligible during
              band calculation, lower values of beta will result in shorter maximum hit  lengths,
              which will yield faster searches.  The default beta is 1E-7: determined empirically
              as a good tradeoff between sensitivity, specificity  and speed.

       --devhelp
              Print help, as with -h , but also include  undocumented  developer  options.  These
              options  are  not listed below. They are under development or experimental, and are
              not guaranteed to even work correctly. Use developer options at your own risk.  The
              only  resources  for  understanding  what  they  actually do are the brief one-line
              description printed when --devhelp is enabled, and the source code.

EXPERT OPTIONS

       --rsearch <f>
              Parameterize emission scores a la RSEARCH, using the RIBOSUM matrix  in  file  <f>.
              (Actually,  the  emission  scores  will  not  be identical to RIBOSUM scores due of
              differences in the modelling strategy between Infernal and RSEARCH, but  they  will
              be as similar as possible.)  RIBOSUM matrix files are included with Infernal in the
              "matrices/" subdirectory of the top-level Infernal directory.  RIBOSUM matrices are
              substitution  score matrices trained specifically for structural RNAs with separate
              single stranded residue and base pair substitution scores. For more information see
              the  RSEARCH publication (Klein and Eddy, BMC Bioinformatics 4:44, 2003). Actually,
              the emission scores will not exactly

              With --rsearch enabled, all alignments in alifile must contain exactly one sequence
              or the --call option must also be enabled.

       --binary
              Save  the  model  in  a compact binary format. The default is a more readable ASCII
              text format.

       --rf   Use reference coordinate annotation (#=GC RF line, in Stockholm) to determine which
              columns  are  consensus,  and which are inserts.  Any non-gap character indicates a
              consensus column. (For example, mark consensus columns with "x", and insert columns
              with ".".)  The default is to determine this automatically; if the frequency of gap
              characters in a column is greater than a threshold, gapthresh  (default  0.5),  the
              column is called an insertion.

       --gapthresh <x>
              Set  the  gap  threshold  (used for determining which columns are insertions versus
              consensus; see --rf above) to <x>.  The default is 0.5.

       --ignorant
              Strip all base pair secondary structure information from all  input  alignments  in
              alifile  before  building  the CM(s). All resulting CM(s) will have zero MATP (base
              pair) nodes, with zero bifurcations.

       --wgsc Use the Gerstein/Sonnhammer/Chothia (GSC) weighting algorithm. This is the  default
              unless  the number of sequences in the alignment exceeds a cutoff (see --pbswitch),
              in which case the default becomes  the  faster  Henikoff  position-based  weighting
              scheme.

       --wblosum
              Use  the BLOSUM filtering algorithm to weight the sequences, instead of the default
              GSC weighting.  Cluster the sequences at a given percentage identity  (see  --wid);
              assign  each cluster a total weight of 1.0, distributed equally amongst the members
              of that cluster.

       --wpb  Use  the  Henikoff  position-based  weighting  scheme.  This  weighting  scheme  is
              automatically  used (overriding --wgsc and --wblosum) if the number of sequences in
              the alignment exceeds a cutoff (see --pbswitch).

       --wnone
              Turn sequence weighting off; e.g. explicitly set all sequence weights to 1.0.

       --wgiven
              Use sequence weights as given in annotation in the  input  alignment  file.  If  no
              weights  were  given,  assume  they  are  all 1.0.  The default is to determine new
              sequence  weights  by  the  Gerstein/Sonnhammer/Chothia  algorithm,  ignoring   any
              annotated weights.

       --pbswitch <n>
              Set  the  cutoff  for  automatically switching the weighting method to the Henikoff
              position-based weighting scheme  to  <n>.   If  the  number  of  sequences  in  the
              alignment exceeds <n> Henikoff weighting is used.  By default <n> is 5000.

       --wid <x>
              Controls  the  behavior  of  the  --wblosum weighting option by setting the percent
              identity for clustering the alignment to <x>.

       --eent Use the entropy weighting strategy to determine the effective sequence number  that
              gives  a  target mean match state relative entropy. This option is the default, and
              can be turned off with --enone.  The  default  target  mean  match  state  relative
              entropy  is  0.59  bits but can be changed with --ere.  The default of 0.59 bits is
              automatically changed if the total relative entropy  of  the  model  (summed  match
              state relative entropy) is less than a cutoff, which is is 6.0 bits by default, but
              can be changed with the expert, undocumented --eX option. If  you  really  want  to
              play with that option, consult the source code.

       --enone
              Turn  off the entropy weighting strategy. The effective sequence number is just the
              number of sequences in the alignment.

       --ere <x>
              Set the target mean match state relative entropy as <x>.   By  default  the  target
              relative entropy per match position is 0.59 bits.

       --null <f>
              Read  a  null  model  from <f>.  The null model defines the probability of each RNA
              nucleotide in background sequence, the default is to use 0.25 for each  nucleotide.
              The format of null files is documented in the User's Guide.

       --prior <f>
              Read  a  Dirichlet  prior  from  <f>, replacing the default mixture Dirichlet.  The
              format of prior files is documented in the User's Guide.

       --ctarget <n>
              Cluster each alignment in alifile by percent identity. Find  a  cutoff  percent  id
              threshold  that  gives  exactly  <n>  clusters  and  build  a separate CM from each
              cluster. If <n> is greater than the  number  of  sequences  in  the  alignment  the
              program  will  not  complain,  and  each  sequence in the alignment will be its own
              cluster.  Each CM will have a positive integer appended to its name indicating  the
              order  in  which  it  was built. For example, if cmbuild --ctarget 3 is called with
              alifile "myrnas.sto", and "myrnas.sto" has exactly one Stockholm  alignment  in  it
              with  no  #=GF  ID tag annotation, three CMs will be built, the first will be named
              "myrnas-1.1", the second, "myrnas-1.2", and the third "myrnas-1.3".  (As  explained
              above  for  the -n option, the first number "1" after "myrnas" indicates the CM was
              built from the first alignment in "myrnas.sto".)

       --cmaxid <x>
              Cluster each sequence alignment in alifile by percent identity. Define clusters  at
              the  cutoff  fractional  id  similarity  of  <x>  and build a separate CM from each
              cluster.  No two sequences will be be more than <x> fractionally identical ( <x>  *
              100  percent  identical) if those two sequences are in different clusters.  The CMs
              are named as described above for --ctarget.

       --call Build a separate CM from each sequence in each alignment in alifile.  Naming of CMs
              takes  place  as  described  above for --ctarget.  Using this option in combination
              with --rsearch causes a separate CM to be built and parameterized using  a  RIBOSUM
              matrix for each sequence in alifile.

       --corig
              After  building  multiple  CMs  using  --ctarget, --cmindiff or --call as described
              above, build a final CM using the complete original alignment  from  alifile.   The
              CMs  are  named as described above for --ctarget with the exception of the final CM
              built from the original alignment which is named in the default manner, without  an
              appended integer.

       --cdump<f>
              Dump  the  multiple  alignments  of  each cluster to <f> in Stockholm format.  This
              option only works in combination with --ctarget, --cmindiff or --call.

       --refine <f>
              Attempt  to  refine  the  alignment  before  building  the  CM  using  expectation-
              maximization  (EM).  A CM is first built from the initial alignment as usual. Then,
              the sequences in the alignment are realigned optimally (with  the  HMM  banded  CYK
              algorithm,  optimal means optimal given the bands) to the CM, and a new CM is built
              from the resulting alignment. The sequences are then realigned to the new CM, and a
              new  CM  is  built  from  that  alignment.  This  is  continued  until convergence,
              specifically  when  the  alignments  for  two   successive   iterations   are   not
              significantly  different  (the  summed  bit  scores  of  all  the  sequences in the
              alignment changes less than  1%  between  two  successive  iterations).  The  final
              alignment  (the  alignment  used  to  build  the CM that gets written to cmfile) is
              written to <f>.

       --gibbs
              Modifies the behavior of --refine so Gibbs sampling is  used  instead  of  EM.  The
              difference  is  that  during  the  alignment stage the alignment is not necessarily
              optimal, instead an alignment (parsetree) for each sequences is  sampled  from  the
              posterior  distribution of alignments as determined by the Inside algorithm. Due to
              this sampling step --gibbs is non-deterministic, so different runs  with  the  same
              alignment  may  yield  different  results.  This  is not true when --refine is used
              without the --gibbs option, in which case the final alignment and CM will always be
              the  same.  When  --gibbs  is  enabled,  the -s  <n> option can be used to seed the
              random number generator predictably, making the results reproducible.  The goal  of
              the  --gibbs  option  is  to  help  expert RNA alignment curators refine structural
              alignments by allowing them to observe alternative high scoring alignments.

       -s <n> Set the random seed to <n>, where <n> is a positive integer. This option  can  only
              be  used  in  combination with --gibbs.  The default is to use time() to generate a
              different seed for each run,  which  means  that  two  different  runs  of  cmbuild
              --refine  <f>  --gibbs  on the same alignment will give slightly different results.
              You can use this option to generate reproducible results.

       -l     With --refine, turn on the local alignment algorithm, which allows the alignment to
              span  two  or  more  subsequences if necessary (e.g. if the structures of the query
              model and target sequence  are  only  partially  shared),  allowing  certain  large
              insertions  and  deletions in the structure to be penalized differently than normal
              indels.  The default is to globally align the query model to the target sequences.

       -a     With --refine, print the scores of each individual sequence alignment.

       --cyk  With --refine, align with the  CYK  algorithm.  By  default  the  optimal  accuracy
              algorithm is used. There is more information on this in the cmalign manual page.

       --sub  With --refine, turn on the sub model construction and alignment procedure. For each
              sequence to be realigned an HMM is first used to predict the model  start  and  end
              consensus  columns,  and  a  new  sub  CM is constructed that only models consensus
              columns from start to end. The sequence is then  aligned  to  this  sub  CM.   This
              option  is  useful for building CMs for alignments with sequences that are known to
              truncated,  non-full  length  sequences.  This  option  is  experimental  and   not
              rigorously  tested,  use at your own risk.  This "sub CM" procedure is not the same
              as the "sub CMs" described by Weinberg and Ruzzo.

       --nonbanded
              With --refine, do not use HMM bands to accelerate  alignment.   Use  the  full  CYK
              algorithm  which  is guaranteed to give the optimal alignment.  This will slow down
              the run significantly, especially for large models.

       --tau <x>
              With --refine, set the tail loss probability used during HMM  band  calculation  to
              <f>.  This is the amount of probability mass within the HMM posterior probabilities
              that is considered negligible. The default  value  is  1E-7.   In  general,  higher
              values  will result in greater acceleration, but increase the chance of missing the
              optimal alignment due to the HMM bands.

       --fins With --refine, change the behavior of  how  insert  emissions  are  placed  in  the
              alignment.   By  default,  all  contiguous blocks of inserts are split in half, and
              half the residues are flushed left against the  nearest  consensus  column  to  the
              left, and half are flushed right against the nearest consensus column on the right.
              With --fins inserts are not split in half, instead all inserted  residues  from  IL
              states  are  flushed left, instead all inserted residues from IR states are flushed
              right. This was the default behavior of previous versions of Infernal.

       --mxsize <x>
              With --refine,  set  the  maximum  allowable  matrix  size  for  alignment  to  <x>
              megabytes.  By default this size is 2 Gb.  This should be large enough for the vast
              majority of alignments, however it is possible that when run with --refine, cmbuild
              will  exit  prematurely,  reporting  an error message that the matrix exceeded it's
              maximum allowable size. In this case, the --mxsize can be used to raise the limit.

       --rdump<x>
              With --refine,  output  the  intermediate  alignments  at  each  iteration  of  the
              refinement procedure (as described above for --refine ) to file <f>.

SEE ALSO

       For  complete  documentation,  see  the  User's  Guide  (Userguide.pdf) that came with the
       distribution; or see the Infernal web page, http://infernal.janelia.org/.

COPYRIGHT

       Copyright (C) 2009 HHMI Janelia Farm Research Campus.
       Freely distributed under the GNU General Public License (GPLv3).
       See the file COPYING that came with the source for details on redistribution conditions.

AUTHOR

       Eric Nawrocki, Diana Kolbe, and Sean Eddy
       HHMI Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147
       http://selab.janelia.org/