Provided by: infernal_1.1.3-4_amd64 bug

NAME

       cmbuild - construct covariance model(s) from structurally annotated RNA multiple sequence alignment(s)

SYNOPSIS

       cmbuild [options] <cmfile_out> <msafile>

DESCRIPTION

       For  each  multiple  sequence  alignment  in <msafile> build a covariance model and save it to a new file
       <cmfile_out>.

       The alignment file must be in Stockholm or SELEX format, and must contain consensus  secondary  structure
       annotation.  cmbuild uses the consensus structure to determine the architecture of the CM.

       <msafile>  may  be '-' (dash), which means reading this input from stdin rather than a file.  To use '-',
       you must also specify the alignment file format with --informat <s>, as in --informat stockholm  (because
       of a current limitation in our implementation, MSA file formats cannot be autodetected in a nonrewindable
       input stream.)

       <cmfile_out> may not be '-' (stdout), because sending the CM file to stdout would conflict with the other
       text output of the program.

       In  addition  to writing CM(s) to <cmfile_out>, cmbuild also outputs a single line for each model created
       to stdout. Each line has the following fields: "aln": the index of the alignment used to  build  the  CM;
       "idx":  the  index  of  the  CM  in  the  <cmfile_out>; "name": the name of the CM; "nseq": the number of
       sequences in the alignment used to build the CM; "eff_nseq": the effective number of  sequences  used  to
       build  the model; "alen": the length of the alignment used to build the CM; "clen": the number of columns
       from the alignment defined as consensus (match) columns; "bps":  the  number  of  basepairs  in  the  CM;
       "bifs":  the number of bifurcations in the CM; "rel entropy: CM": the total relative entropy of the model
       divided by the number of consensus columns; "rel entropy: HMM": the total relative entropy of  the  model
       ignoring  secondary  structure divided by the number of consensus columns.  "description": description of
       the model/alignment.

OPTIONS

       -h     Help; print a brief reminder of command line usage and available options.

       -n <s> Name the new CM <s>.  The default is to use the name of the alignment (if one is  present  in  the
              <msafile>),  or,  failing  that,  the  name of the <msafile>.  If <msafile> contains more than one
              alignment, -n doesn't work, and every alignment must have a name annotated in the <msafile> (as in
              Stockholm #=GF ID annotation).

       -F     Allow <cmfile_out> to be overwritten. Without this option, if <cmfile_out> already exists, cmbuild
              exits with an error.

       -o <f> Direct the summary output to file <f>, rather than to stdout.

       -O <f> After each model is constructed, resave annotated source alignments to a  file  <f>  in  Stockholm
              format.   Sequences are annoted with what relative sequence weights were assigned.  The alignments
              are also annotated with a reference annotation line indicating  which  columns  were  assigned  as
              consensus.  If  the source alignment had reference annotation ("#=GC RF") it will be replaced with
              the consensus residue of the model for consensus columns and '.' for insert  columns,  unless  the
              --hand option was used for specifying consensus positions, in which case it will be unchanged.

              --devhelp  Print help, as with -h , but also include expert options that are not displayed with -h
              .  These expert options are not expected to be relevant for the vast majority of users and so  are
              not  described in the manual page.  The only resources for understanding what they actually do are
              the brief one-line descriptions output when --devhelp is enabled, and the source code.

OPTIONS CONTROLLING MODEL CONSTRUCTION

       These options control how consensus columns are defined in an alignment.

       --fast Define consensus columns automatically as those that have a fraction >=  symfrac  of  residues  as
              opposed to gaps. (See below for the --symfrac option.) This is the default.

       --hand Use  reference  coordinate  annotation (#=GC RF line, in Stockholm) to determine which columns are
              consensus, and which are inserts.  Any  non-gap  character  indicates  a  consensus  column.  (For
              example,  mark  consensus  columns  with "x", and insert columns with ".".) This option was called
              --rf in previous versions of Infernal (0.1 through 1.0.2).

       --symfrac <x>
              Define the residue fraction threshold necessary to  define  a  consensus  column  when  not  using
              --hand.   The  default  is  0.5.  The  symbol  fraction  in each column is calculated after taking
              relative sequence weighting into account.  Setting this to 0.0 means that every  alignment  column
              will  be  assigned  as  consensus, which may be useful in some cases. Setting it to 1.0 means that
              only columns that include 0 gaps  will  be  assigned  as  consensus.   This  option  replaces  the
              --gapthresh  <y>  option from previous versions of Infernal (0.1 through 1.0.2), with <x> equal to
              (1.0 - <y>).  For example to reproduce behavior for a command of cmbuild  --gapthresh   0.8  in  a
              previous version, use cmbuild --symfrac  0.2 with this version.

       --noss Ignore  the  secondary  structure  annotation,  if  any,  in  <msafile>  and  build a CM with zero
              basepairs. This model will be similar to a profile HMM and the cmsearch and cmscan  programs  will
              use  HMM  algorithms  which  are faster than CM ones for this model. Additionally, a zero basepair
              model need not be calibrated with cmcalibrate prior to running cmsearch with it. The --noss option
              must be used if there is no secondary structure annotation in <msafile>.

       --rsearch <f>
              Parameterize  emission  scores a la RSEARCH, using the RIBOSUM matrix in file <f>.  With --rsearch
              enabled, all alignments in <msafile> must contain exactly one sequence or the --call  option  must
              also be enabled. All positions in each sequence will be considered consensus "columns".  Actually,
              the emission scores for these models will not be identical to RIBOSUM scores due of differences in
              the  modelling  strategy  between  Infernal  and RSEARCH, but they will be as similar as possible.
              RIBOSUM matrix files are included with Infernal in the "matrices/" subdirectory of  the  top-level
              "infernal-xxx"  directory.   RIBOSUM matrices are substitution score matrices trained specifically
              for structural RNAs with separate single stranded residue and base pair substitution  scores.  For
              more information see the RSEARCH publication (Klein and Eddy, BMC Bioinformatics 4:44, 2003).

OTHER MODEL CONSTRUCTION OPTIONS

       --null <f>
              Read  a  null  model  from  <f>.  The null model defines the probability of each RNA nucleotide in
              background sequence, the default is to use 0.25 for each nucleotide.  The format of null files  is
              specified in the user guide.

       --prior <f>
              Read  a  Dirichlet  prior  from <f>, replacing the default mixture Dirichlet.  The format of prior
              files is specified in the user guide.

       Use --devhelp to see additional, otherwise undocumented, model construction options.

OPTIONS CONTROLLING RELATIVE WEIGHTS

       cmbuild uses an ad hoc sequence weighting algorithm to downweight closely related sequences and  upweight
       distantly  related  ones.  This  has  the  effect  of  making  models  less biased by uneven phylogenetic
       representation. For example, two identical sequences would typically each receive half  the  weight  that
       one sequence would.  These options control which algorithm gets used.

       --wpb  Use  the  Henikoff  position-based sequence weighting scheme [Henikoff and Henikoff, J. Mol. Biol.
              243:574, 1994].  This is the default.

       --wgsc Use the Gerstein/Sonnhammer/Chothia weighting algorithm [Gerstein et al, J. Mol.  Biol.  235:1067,
              1994].

       --wnone
              Turn sequence weighting off; e.g. explicitly set all sequence weights to 1.0.

       --wgiven
              Use sequence weights as given in annotation in the input alignment file. If no weights were given,
              assume  they  are  all  1.0.   The  default  is  to  determine  new  sequence   weights   by   the
              Gerstein/Sonnhammer/Chothia algorithm, ignoring any annotated weights.

       --wblosum
              Use  the BLOSUM filtering algorithm to weight the sequences, instead of the default GSC weighting.
              Cluster the sequences at a given percentage identity (see --wid);  assign  each  cluster  a  total
              weight of 1.0, distributed equally amongst the members of that cluster.

       --wid <x>
              Controls  the  behavior  of  the  --wblosum  weighting  option by setting the percent identity for
              clustering the alignment to <x>.

OPTIONS CONTROLLING EFFECTIVE SEQUENCE NUMBER

       After relative weights are determined, they are normalized to sum to a total effective  sequence  number,
       eff_nseq.   This  number  may be the actual number of sequences in the alignment, but it is almost always
       smaller than that.  The default entropy weighting method (--eent) reduces the effective  sequence  number
       to  reduce  the  information  content  (relative entropy, or average expected score on true homologs) per
       consensus position. The target relative entropy is controlled by a two-parameter function, where the  two
       parameters are settable with --ere and --esigma.

       --eent Use  the entropy weighting strategy to determine the effective sequence number that gives a target
              mean match state relative entropy. This option is the default, and can be turned off with --enone.
              The  default  target  mean  match  state  relative entropy is 0.59 bits for models with at least 1
              basepair and 0.38 bits for models with zero basepairs, but can be changed with --ere.  The default
              of  0.59  or 0.38 bits is automatically changed if the total relative entropy of the model (summed
              match state relative entropy) is less than a cutoff, which is controlled by the  --esigma  option.
              If you really want to play with that option, consult the source code.  Additionally, the effective
              sequence number cannot be larger than the number of sequences in the alignment, although this  can
              be overridden to set the maximum possible effective sequence number with the --emaxseq option.

       --enone
              Turn  off  the  entropy  weighting  strategy.  The effective sequence number is just the number of
              sequences in the alignment.

       --ere <x>
              Set the target mean match state relative entropy as <x>.  By default the target  relative  entropy
              per  match position is 0.59 bits for models with at least 1 basepair and 0.38 for models with zero
              basepairs.

       --eminseq <x>
              Define the minimum allowed effective sequence number as <x>.

       --emaxseq <x>
              Define the maximum allowed effective sequence number as <x>.  This number can be larger  than  the
              number of sequences in the alignment.

       --ehmmre <x>
              Set the target HMM mean match state relative entropy as <x>.  Entropy for basepairing match states
              is calculated using marginalized basepair emission probabilities.

       --eset <x>
              Set the effective sequence number for entropy weighting as <x>.

OPTIONS CONTROLLING FILTER P7 HMM CONSTRUCTION

       For each CM that cmbuild constructs, an accompanying filter p7 HMM is built from the input  alignment  as
       well. These options control filter HMM construction:

       --p7ere <x>
              Set  the  target  mean  match state relative entropy for the filter p7 HMM as <x>.  By default the
              target relative entropy per match position is 0.38 bits.

       --p7ml Use a maximum likelihood p7 HMM built from the CM as the filter HMM. This HMM will be  as  similar
              as possible to the CM (while necessarily ignorant of secondary structure).

       Use --devhelp to see additional, otherwise undocumented, filter HMM construction options.

OPTIONS CONTROLLING FILTER P7 HMM CALIBRATION

       After building each filter HMM, cmbuild determines appropriate E-value parameters to use during filtering
       in cmsearch and cmscan by  sampling  a  set  of  sequences  and  searching  them  with  each  HMM  filter
       configuration and algorithm.

       --EmN  <n>  Set  the  number  of  sampled  sequences for local MSV filter HMM calibration to <n>.  200 by
       default.

       --EvN <n> Set the number of sampled sequences for local Viterbi filter HMM calibration to  <n>.   200  by
       default.

       --ElfN  <n>  Set the number of sampled sequences for local Forward filter HMM calibration to <n>.  200 by
       default.

       --EgfN <n> Set the number of sampled sequences for glocal Forward filter HMM calibration to <n>.  200  by
       default.

       Use --devhelp to see additional, otherwise undocumented, filter HMM calibration options.

OPTIONS FOR REFINING THE INPUT ALIGNMENT

       --refine <f>
              Attempt  to  refine the alignment before building the CM using expectation-maximization (EM). A CM
              is first built from the initial alignment as usual. Then,  the  sequences  in  the  alignment  are
              realigned  optimally (with the HMM banded CYK algorithm, optimal means optimal given the bands) to
              the CM, and a new CM is built from the resulting alignment. The sequences are  then  realigned  to
              the  new  CM,  and  a  new  CM  is built from that alignment. This is continued until convergence,
              specifically when the alignments for two successive iterations  are  not  significantly  different
              (the  summed  bit  scores  of  all the sequences in the alignment changes less than 1% between two
              successive iterations). The final alignment (the alignment used to build the CM that gets  written
              to <cmfile_out>) is written to <f>.

       -l     With  --refine,  turn  on the local alignment algorithm, which allows the alignment to span two or
              more subsequences if necessary (e.g. if the structures of the query model and target sequence  are
              only  partially  shared),  allowing  certain large insertions and deletions in the structure to be
              penalized differently than normal indels.  The default is to globally align the query model to the
              target sequences.

       --gibbs
              Modifies  the behavior of --refine so Gibbs sampling is used instead of EM. The difference is that
              during the alignment stage  the  alignment  is  not  necessarily  optimal,  instead  an  alignment
              (parsetree)  for  each  sequences  is  sampled  from  the  posterior distribution of alignments as
              determined by the Inside algorithm. Due to this sampling step  --gibbs  is  non-deterministic,  so
              different runs with the same alignment may yield different results. This is not true when --refine
              is used without the --gibbs option, in which case the final alignment and CM will  always  be  the
              same.  When  --gibbs  is  enabled,  the  --seed   <n> option can be used to seed the random number
              generator predictably, making the results reproducible.  The goal of the --gibbs option is to help
              expert RNA alignment curators refine structural alignments by allowing them to observe alternative
              high scoring alignments.

       --seed <n>
              Seed the random number generator with <n>, an integer >= 0.  This  option  can  only  be  used  in
              combination  with  --gibbs.   If  <n>  is  nonzero,  stochastic  sampling  of  alignments  will be
              reproducible; the same command will give the same  results.   If  <n>  is  0,  the  random  number
              generator  is  seeded  arbitrarily,  and stochastic samplings may vary from run to run of the same
              command.  The default seed is 0.

       --cyk  With --refine, align with the CYK algorithm. By default the optimal accuracy  algorithm  is  used.
              There is more information on this in the cmalign manual page.

       --notrunc
              With  --refine,  turn off the the truncated alignment algorithm. There is more information on this
              in the cmalign manual page.

       Use --devhelp to see additional, otherwise undocumented, alignment refinement options as  well  as  other
       output file options and options for building multiple models for a single alignment.

SEE ALSO

       See  infernal(1)  for  a  master man page with a list of all the individual man pages for programs in the
       Infernal package.

       For complete documentation, see the user guide that came with your Infernal distribution (Userguide.pdf);
       or see the Infernal web page ().

COPYRIGHT

       Copyright (C) 2019 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For  additional  information  on  copyright and licensing, see the file called COPYRIGHT in your Infernal
       source distribution, or see the Infernal web page ().

AUTHOR

       The Eddy/Rivas Laboratory
       Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147 USA
       http://eddylab.org