Provided by: infernal_1.1.5-2_amd64 bug

NAME

       cmbuild  - construct covariance model(s) from structurally annotated RNA multiple sequence
       alignment(s)

SYNOPSIS

       cmbuild [options] <cmfile_out> <msafile>

DESCRIPTION

       For each multiple sequence alignment in <msafile> build a covariance model and save it  to
       a new file <cmfile_out>.

       The  alignment  file  must  be  in  Stockholm  or SELEX format, and must contain consensus
       secondary structure annotation.  cmbuild uses the consensus  structure  to  determine  the
       architecture of the CM.

       <msafile> may be '-' (dash), which means reading this input from stdin rather than a file.
       To use '-', you must also specify the alignment file format with  --informat  <s>,  as  in
       --informat  stockholm  (because  of  a  current limitation in our implementation, MSA file
       formats cannot be autodetected in a nonrewindable input stream.)

       <cmfile_out> may not be '-' (stdout), because sending the CM file to stdout would conflict
       with the other text output of the program.

       In  addition to writing CM(s) to <cmfile_out>, cmbuild also outputs a single line for each
       model created to stdout. Each line has the following  fields:  "aln":  the  index  of  the
       alignment  used  to  build the CM; "idx": the index of the CM in the <cmfile_out>; "name":
       the name of the CM; "nseq": the number of sequences in the alignment used to build the CM;
       "eff_nseq":  the effective number of sequences used to build the model; "alen": the length
       of the alignment used to build the CM; "clen": the number of columns  from  the  alignment
       defined  as  consensus  (match) columns; "bps": the number of basepairs in the CM; "bifs":
       the number of bifurcations in the CM; "rel entropy: CM": the total relative entropy of the
       model  divided  by the number of consensus columns; "rel entropy: HMM": the total relative
       entropy of the model ignoring secondary structure  divided  by  the  number  of  consensus
       columns.  "description": description of the model/alignment.

OPTIONS

       -h     Help; print a brief reminder of command line usage and available options.

       -n <s> Name  the  new  CM <s>.  The default is to use the name of the alignment (if one is
              present in the <msafile>), or,  failing  that,  the  name  of  the  <msafile>.   If
              <msafile>  contains  more  than one alignment, -n doesn't work, and every alignment
              must have a name annotated in the <msafile> (as in Stockholm #=GF ID annotation).

       -F     Allow <cmfile_out> to be overwritten. Without this option, if <cmfile_out>  already
              exists, cmbuild exits with an error.

       -o <f> Direct the summary output to file <f>, rather than to stdout.

       -O <f> After  each  model is constructed, resave annotated source alignments to a file <f>
              in Stockholm format.  Sequences are annoted with  what  relative  sequence  weights
              were  assigned.  The alignments are also annotated with a reference annotation line
              indicating which columns were assigned as consensus. If the  source  alignment  had
              reference  annotation ("#=GC RF") it will be replaced with the consensus residue of
              the model for consensus columns and '.'  for  insert  columns,  unless  the  --hand
              option  was  used  for  specifying  consensus  positions,  in which case it will be
              unchanged. Any sequences defined as fragments will be annotated as well, using only
              ~  characters  before  the  first  residue  and after the final residue, unless the
              --fraggiven option was used.

              --devhelp Print help, as with -h , but also include expert  options  that  are  not
              displayed  with  -h .  These expert options are not expected to be relevant for the
              vast majority of users and so are not described  in  the  manual  page.   The  only
              resources   for  understanding  what  they  actually  do  are  the  brief  one-line
              descriptions output when --devhelp is enabled, and the source code.

OPTIONS CONTROLLING MODEL CONSTRUCTION

       --fast Define consensus columns automatically as those that have a fraction >= symfrac  of
              residues  as  opposed  to  gaps.  (See below for the --symfrac option.) This is the
              default.

       --hand Use reference coordinate annotation (#=GC RF line, in Stockholm) to determine which
              columns  are  consensus,  and which are inserts.  Any non-gap character indicates a
              consensus column. (For example, mark consensus columns with "x", and insert columns
              with  ".".)  This  option  was  called  --rf  in previous versions of Infernal (0.1
              through 1.0.2).

       --symfrac <x>
              Define the residue fraction threshold necessary to define a consensus  column  when
              not  using  --hand.   The  default  is  0.5.  The symbol fraction in each column is
              calculated after taking relative sequence weighting into account.  Setting this  to
              0.0  means  that every alignment column will be assigned as consensus, which may be
              useful in some cases. Setting it to 1.0 means that only columns that include 0 gaps
              will  be  assigned  as  consensus.  This option replaces the --gapthresh <y> option
              from previous versions of Infernal (0.1 through 1.0.2), with <x> equal  to  (1.0  -
              <y>).   For example to reproduce behavior for a command of cmbuild --gapthresh  0.8
              in a previous version, use cmbuild --symfrac  0.2 with this version.

       --fragthresh <x>
              We only want to count terminal gaps as deletions if the aligned sequence  is  known
              to  be  full-length, not if it is a fragment (for instance, because only part of it
              was sequenced). A fragment is  defined  as  any  aligned  sequence  for  which  the
              fractional  span,  defined  as  its  aligned  length from its first to last residue
              divided by the total alignment length, is less than "0.8" (by default).  Note  that
              this  differs  from  the  way  HMMER  defines  fragments  (as  of v3.3.2).  Setting
              --fragthresh 0 will define no sequence as a fragment; you might want to do this  if
              you know you have a carefully curated alignment of full-length sequences or want to
              mimic the behavior  of  older  versions  of  Infernal  (v1.1  to  v1.1.4).  Setting
              --fragthresh  1  will  define  all  sequences  as  fragments.  The --fragnrfpos and
              --fraggiven offer alternative ways to define fragments.

       --fragnrfpos <n>
              Define a sequence as a fragment if it has more than <n> gaps in terminal  consensus
              positions  at  the  5' or 3' ends. This option can only be used in combination with
              the --hand option, and if it is used, the --fragthresh option is ignored.

       --fraggiven
              Do not infer which sequences are fragments  based  on  their  lengths  but  do  use
              fragment  information  in  the input alignment, if there is any.  For a sequence in
              the input alignment to be considered a fragment, all positions before (5'  of)  the
              first  nucleotide  and  all  positions after (3' of) the final nucleotide must be ~
              symbols. Importantly, ~ symbols are not allowed anywhere else in the alignment.

       --noss Ignore the secondary structure annotation, if any, in <msafile> and build a CM with
              zero  basepairs.  This  model will be similar to a profile HMM and the cmsearch and
              cmscan programs will use HMM algorithms which are faster  than  CM  ones  for  this
              model.  Additionally, a zero basepair model need not be calibrated with cmcalibrate
              prior to running cmsearch with it. The --noss option must be used if  there  is  no
              secondary structure annotation in <msafile>.

       --rsearch <f>
              Parameterize  emission  scores  a la RSEARCH, using the RIBOSUM matrix in file <f>.
              With --rsearch enabled, all  alignments  in  <msafile>  must  contain  exactly  one
              sequence  or the --call option must also be enabled. All positions in each sequence
              will be considered consensus "columns".  Actually, the emission  scores  for  these
              models  will not be identical to RIBOSUM scores due of differences in the modelling
              strategy between Infernal and RSEARCH, but they will be  as  similar  as  possible.
              RIBOSUM  matrix files are included with Infernal in the "matrices/" subdirectory of
              the top-level "infernal-xxx" directory.  RIBOSUM matrices  are  substitution  score
              matrices  trained  specifically  for  structural RNAs with separate single stranded
              residue and base pair substitution scores. For more  information  see  the  RSEARCH
              publication (Klein and Eddy, BMC Bioinformatics 4:44, 2003).

       --consrf
              With  --hand  use the model's consensus sequence for the model reference annotation
              instead of the RF annotation from the input alignment.

OTHER MODEL CONSTRUCTION OPTIONS

       --null <f>
              Read a null model from <f>.  The null model defines the  probability  of  each  RNA
              nucleotide  in background sequence, the default is to use 0.25 for each nucleotide.
              The format of null files is specified in the user guide.

       --prior <f>
              Read a Dirichlet prior from <f>, replacing  the  default  mixture  Dirichlet.   The
              format of prior files is specified in the user guide.

       Use --devhelp to see additional, otherwise undocumented, model construction options.

OPTIONS CONTROLLING RELATIVE WEIGHTS

       cmbuild  uses  an  ad  hoc  sequence  weighting  algorithm  to  downweight closely related
       sequences and upweight distantly related ones. This has the effect of making  models  less
       biased  by  uneven phylogenetic representation. For example, two identical sequences would
       typically each receive half the weight that one sequence  would.   These  options  control
       which algorithm gets used.

       --wpb  Use  the  Henikoff position-based sequence weighting scheme [Henikoff and Henikoff,
              J. Mol. Biol. 243:574, 1994].  This is the default.

       --wgsc Use the Gerstein/Sonnhammer/Chothia weighting algorithm [Gerstein et  al,  J.  Mol.
              Biol. 235:1067, 1994].

       --wnone
              Turn sequence weighting off; e.g. explicitly set all sequence weights to 1.0.

       --wgiven
              Use  sequence  weights  as  given  in annotation in the input alignment file. If no
              weights were given, assume they are all 1.0.   The  default  is  to  determine  new
              sequence   weights  by  the  Gerstein/Sonnhammer/Chothia  algorithm,  ignoring  any
              annotated weights.

       --wblosum
              Use the BLOSUM filtering algorithm to weight the sequences, instead of the  default
              GSC  weighting.   Cluster the sequences at a given percentage identity (see --wid);
              assign each cluster a total weight of 1.0, distributed equally amongst the  members
              of that cluster.

       --wid <x>
              Controls  the  behavior  of  the  --wblosum weighting option by setting the percent
              identity for clustering the alignment to <x>.

OPTIONS CONTROLLING EFFECTIVE SEQUENCE NUMBER

       After relative weights are determined, they are normalized to sum  to  a  total  effective
       sequence  number,  eff_nseq.   This  number  may  be the actual number of sequences in the
       alignment, but it is almost always smaller  than  that.   The  default  entropy  weighting
       method  (--eent)  reduces  the effective sequence number to reduce the information content
       (relative entropy, or average expected score on true homologs) per consensus position. The
       target  relative  entropy  is  controlled  by  a  two-parameter  function,  where  the two
       parameters are settable with --ere and --esigma.

       --eent Use the entropy weighting strategy to determine the effective sequence number  that
              gives  a  target mean match state relative entropy. This option is the default, and
              can be turned off with --enone.  The  default  target  mean  match  state  relative
              entropy  is  0.59 bits for models with at least 1 basepair and 0.38 bits for models
              with zero basepairs, but can be changed with --ere.  The default of  0.59  or  0.38
              bits  is  automatically  changed if the total relative entropy of the model (summed
              match state relative entropy) is less than a cutoff, which  is  controlled  by  the
              --esigma  option.  If  you really want to play with that option, consult the source
              code.  Additionally, the effective sequence number cannot be larger than the number
              of  sequences  in the alignment, although this can be overridden to set the maximum
              possible effective sequence number with the --emaxseq option.

       --enone
              Turn off the entropy weighting strategy. The effective sequence number is just  the
              number of sequences in the alignment.

       --ere <x>
              Set  the  target  mean  match state relative entropy as <x>.  By default the target
              relative entropy per match position is  0.59  bits  for  models  with  at  least  1
              basepair and 0.38 for models with zero basepairs.

       --eminseq <x>
              Define the minimum allowed effective sequence number as <x>.

       --emaxseq <x>
              Define  the  maximum  allowed effective sequence number as <x>.  This number can be
              larger than the number of sequences in the alignment.

       --ehmmre <x>
              Set the target  HMM  mean  match  state  relative  entropy  as  <x>.   Entropy  for
              basepairing  match  states  is  calculated  using  marginalized  basepair  emission
              probabilities.

       --eset <x>
              Set the effective sequence number for entropy weighting as <x>.

OPTIONS CONTROLLING FILTER P7 HMM CONSTRUCTION

       For each CM that cmbuild constructs, an accompanying filter p7 HMM is built from the input
       alignment as well. These options control filter HMM construction:

       --p7ere <x>
              Set  the target mean match state relative entropy for the filter p7 HMM as <x>.  By
              default the target relative entropy per match position is 0.38 bits.

       --p7ml Use a maximum likelihood p7 HMM built from the CM as the filter HMM. This HMM  will
              be  as  similar  as  possible  to  the  CM (while necessarily ignorant of secondary
              structure).

       Use --devhelp to see additional, otherwise undocumented, filter HMM construction options.

OPTIONS CONTROLLING FILTER P7 HMM CALIBRATION

       After building each filter HMM, cmbuild determines appropriate E-value parameters  to  use
       during  filtering in cmsearch and cmscan by sampling a set of sequences and searching them
       with each HMM filter configuration and algorithm.

       --EmN <n> Set the number of sampled sequences for local MSV filter HMM calibration to <n>.
       200 by default.

       --EvN  <n> Set the number of sampled sequences for local Viterbi filter HMM calibration to
       <n>.  200 by default.

       --ElfN <n> Set the number of sampled sequences for local Forward filter HMM calibration to
       <n>.  200 by default.

       --EgfN  <n>  Set the number of sampled sequences for glocal Forward filter HMM calibration
       to <n>.  200 by default.

       Use --devhelp to see additional, otherwise undocumented, filter HMM calibration options.

OPTIONS FOR REFINING THE INPUT ALIGNMENT

       --refine <f>
              Attempt  to  refine  the  alignment  before  building  the  CM  using  expectation-
              maximization  (EM).  A CM is first built from the initial alignment as usual. Then,
              the sequences in the alignment are realigned optimally (with  the  HMM  banded  CYK
              algorithm,  optimal means optimal given the bands) to the CM, and a new CM is built
              from the resulting alignment. The sequences are then realigned to the new CM, and a
              new  CM  is  built  from  that  alignment.  This  is  continued  until convergence,
              specifically  when  the  alignments  for  two   successive   iterations   are   not
              significantly  different  (the  summed  bit  scores  of  all  the  sequences in the
              alignment changes less than  1%  between  two  successive  iterations).  The  final
              alignment (the alignment used to build the CM that gets written to <cmfile_out>) is
              written to <f>.

       -l     With --refine, turn on the local alignment algorithm, which allows the alignment to
              span  two  or  more  subsequences if necessary (e.g. if the structures of the query
              model and target sequence  are  only  partially  shared),  allowing  certain  large
              insertions  and  deletions in the structure to be penalized differently than normal
              indels.  The default is to globally align the query model to the target sequences.

       --gibbs
              Modifies the behavior of --refine so Gibbs sampling is  used  instead  of  EM.  The
              difference  is  that  during  the  alignment stage the alignment is not necessarily
              optimal, instead an alignment (parsetree) for each sequences is  sampled  from  the
              posterior  distribution of alignments as determined by the Inside algorithm. Due to
              this sampling step --gibbs is non-deterministic, so different runs  with  the  same
              alignment  may  yield  different  results.  This  is not true when --refine is used
              without the --gibbs option, in which case the final alignment and CM will always be
              the  same.  When --gibbs is enabled, the --seed  <n> option can be used to seed the
              random number generator predictably, making the results reproducible.  The goal  of
              the  --gibbs  option  is  to  help  expert RNA alignment curators refine structural
              alignments by allowing them to observe alternative high scoring alignments.

       --seed <n>
              Seed the random number generator with <n>, an integer >= 0.  This option  can  only
              be  used  in  combination  with --gibbs.  If <n> is nonzero, stochastic sampling of
              alignments will be reproducible; the same command will give the same  results.   If
              <n>  is  0,  the  random  number  generator  is  seeded arbitrarily, and stochastic
              samplings may vary from run to run of the same command.  The default seed is 0.

       --cyk  With --refine, align with the  CYK  algorithm.  By  default  the  optimal  accuracy
              algorithm is used. There is more information on this in the cmalign manual page.

       --notrunc
              With  --refine,  turn  off  the  the  truncated  alignment algorithm. There is more
              information on this in the cmalign manual page.

       --miss With --refine, in the final alignment and each intermediate alignment, consider all
              sequences  with  terminal  gaps  as  fragments for purposes of building models from
              those alignments. You may want to do this if you have many sequences that  are  not
              full length, e.g. fragmentary because only part of it was sequenced.

       Use  --devhelp  to see additional, otherwise undocumented, alignment refinement options as
       well as other output file options and options for building multiple models  for  a  single
       alignment.

SEE ALSO

       See  infernal(1)  for  a  master  man page with a list of all the individual man pages for
       programs in the Infernal package.

       For complete documentation, see the user guide that came with your  Infernal  distribution
       (Userguide.pdf); or see the Infernal web page (http://eddylab.org/infernal/).

COPYRIGHT

       Copyright (C) 2023 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For  additional  information  on copyright and licensing, see the file called COPYRIGHT in
       your    Infernal    source    distribution,    or    see    the    Infernal    web    page
       (http://eddylab.org/infernal/).

AUTHOR

       http://eddylab.org