Provided by: infernal_1.0.2-2_amd64 bug

NAME

       cmsearch - search a sequence database for RNAs homologous to a CM

SYNOPSIS

       cmsearch [options] cmfile seqfile

DESCRIPTION

       cmsearch  uses  the  covariance  model  (CM)  in  cmfile  to search for homologous RNAs in
       seqfile, and outputs high-scoring alignments.

       Currently, the sequence file must be in FASTA format.

       CMs are profiles of RNA consensus sequence and secondary structure. A CM file is  produced
       by  the cmbuild program, from a given RNA sequence alignment of known consensus structure.
       CM files can be calibrated  prior  to  running  cmsearch  with  the  cmcalibrate  program.
       Searches  with  calibrated  CM files will include E-values and will use appropriate filter
       thresholds for acceleration. It is strongly recommended to calibrate your CM files  before
       using  cmsearch.  CM calibration is described in more detail below and in chapters 5 and 6
       of the User's Guide.

       cmsearch output consists of alignments of all hits in the database  sorted  by  decreasing
       score  per  sequence  and per strand. That is, all hits for the same sequence and the same
       (Watson or Crick) strand are sorted, but hits across sequences or strands are not sorted.

       The threshold for reporting scores is different depending on whether the CM file has  been
       calibrated or not.  If the CM file has been calibrated, the default reporting threshold is
       an E-value of 1.0. This is the threshold at which 1 hit  is  expected  by  chance.  It  is
       possible  to  manually  set  the  threshold  to  bit  score <x> using the -T <x> option as
       described below, or to E-value <x> using the -E <x> option. The -E option will  only  work
       if the CM file has been calibrated.

       RNA homology search with CMs is slow.  To speed it up, cmsearch by default uses two rounds
       of filters with faster algorithms to prune the database prior to searching with  the  slow
       CM  algorithm.   The  first  round  of filtering is faster but less strict than the second
       round. First, the full database is searched with the first round  filter,  then  any  hits
       that  survive  the first round are searched with the second round filter. Finally any hits
       that survive the first and second round of filtering are searched  with  the  final  round
       search  strategy.   During  the  filter  rounds,  hits  are padded with a short stretch of
       residues on either side prior to searching with the subsequent round.  The exact number of
       residues is dependent on the size of the model being searched with.

       The  first  round of filtering is performed with an HMM. If the CM file is calibrated, the
       threshold for the HMM filter will  be  automatically  chosen  as  an  appropriate  one  as
       determined in cmcalibrate.  The minimum threshold that will automatically be chosen is the
       threshold that will allow a predicted fraction (0.02 by default, changeable  to  <x>  with
       --fil-Smin-hmm  <x>  )  of the database to survive the filter.  The maximum threshold that
       will automatically be chosen is the threshold that will allow a predicted fraction (0.5 by
       default,  changeable  to  <x>  with  --fil-Smax-hmm  <x>  ) of the database to survive the
       filter. If the threshold from cmcalibrate is greater than this maximum fraction,  the  HMM
       filter will be turned off and not used.  To ensure that the HMM filter is never turned off
       and always uses a threshold that gives this maximum fraction you must use the  --fil-A-hmm
       option.  If the model is not calibrated, the default HMM filter threshold is 3.0 bits. The
       HMM filter threshold can be manually set to bit score <x> using the --fil-T-hmm <x> option
       as  described  below,  or  to  E-value <x> using the --fil-E-hmm <x> option, or to the bit
       score that will allow a predicted database fraction of <x> to survive the filter using the
       --fil-S-hmm  <x>  option. The --fil-E-hmm and --fil-S-hmm options will only work if the CM
       file has been calibrated.  The HMM filter can be turned off with the --fil-no-hmm option.

       The second round of filtering is performed with the CM CYK algorithm (not  an  HMM)  using
       query-dependent banding (QDB) for acceleration.  Briefly, QDB precalculates regions of the
       dynamic programming matrix that have  negligible  probability  based  on  the  query  CM's
       transition  probabilities.  During search, these regions of the matrix are ignored to make
       searches faster.  For more information on QDB see (Nawrocki and Eddy,  PLoS  Computational
       Biology  3(3):  e56).   The  beta  paramater  is the amount of probability mass considered
       negligible during band calculation, lower values of beta yield greater speedups but also a
       greater  chance  of  missing  the optimal alignment. The default beta is 1E-10: determined
       empirically as a good tradeoff between sensitivity and speed, though  this  value  can  be
       changed  with  the  --fil-beta  <x>  option.   If the CM file has been calibrated, the QDB
       filter threshold will be automatic set to an appropriate value using an  ad-hoc  procedure
       (see  the  User's  Guide).  If the CM file has not been calibrated, the default QDB filter
       threshold is 0.0 bits.  The QDB filter threshold can be manually  set  to  bit  score  <x>
       using  the --fil-T-qdb <x> option as described below, or to E-value <x> using the --fil-E-
       qdb <x> option. The --fil-E-qdb option will only work if the CM file has been  calibrated.
       The QDB filter can be turned off with the --fil-no-qdb option.

       Another  way  to  accelerate  cmsearch  is  to  run  it  in  parallel with MPI on multiple
       computers.  To do this, use the --mpi option and run cmsearch inside a MPI wrapper program
       such as mpirun.  For example: mpirun C cmsearch --mpi [other options] cmfile seqfile.  The
       cmsearch program must have  been  compiled  in  MPI  mode  for  this  to  work.   See  the
       Installation section of the User's Guide for more information.

       The  --forecast  <n>  option will estimate how long a search will take for your cmfile and
       seqfile on <n> processors. Unless you plan on running cmsearch in MPI mode, <n> should  be
       set as 1.

       Another  technique for accelerated CM homology search with HMM filters is the construction
       and use of a "rigorous filter" HMM which was developed by Zasha Weinberg and Larry  Ruzzo.
       All  hits  above  a  certain  CM  bit  score  threshold  are guaranteed to survive the HMM
       filtering step. Their implementation of rigorous filters has  been  included  in  previous
       versions  of Infernal, but not in the current version. For more information see the User's
       Guide.

OUTPUT

       By default, cmsearch outputs the alignments of search hits  that  score  above  the  final
       search  round  threshold. The format of this output is described in the "Tutorial" section
       of the User's Guide. This format has purposefully not been changed from the  0.x  versions
       of  Infernal so as not to break existing parsers. However, it can be augmented with a line
       of output that marks non-compensatory (negative scoring) basepairs with an  'x'  by  using
       the  -x  option. Alternatively, only negative scoring non-canonical basepairs (those other
       than A:U, U:A, C:G, G:C, U:G, and G:U) are marked if the -v option is enabled.  These  two
       options were added to facilitate quick analysis of the secondary structure of hits by eye.
       Additionally, the -p option can be used to annotate  the  posterior  probability  of  each
       aligned residue in the hit alignments as described below.

       The  --tabfile  <f>  outputs a tabular representation of the hits found by cmsearch to the
       file <f>.  Each non-such line has 9 fields: "model name" the name of the CM used  for  the
       search,  "target name" the name of the target sequence the hit was found in, "target coord
       - start": the start position of the hit in the target sequence, "target coord - stop": the
       end  position  of hit in the target sequence, "query coord - start": the start position of
       the hit in the query model, "query coord - stop": the end position of  hit  in  the  query
       sequence,  "bit  sc":   the  bit  score  of the hit, "E-value": the E-value of the hit (if
       available, "-" if not), and "GC" the percentage of G and C residues in the hit within  the
       target  sequence.  cmsearch tab files can be used as input to the Easel miniapp esl-sfetch
       (included in the easel/miniapp/ subdirectory of infernal) with the -C -f --tabfile options
       to  extract  all the hits from the target database file to a new FASTA file. This file can
       then be aligned to a CM with cmalign.

OPTIONS

       -h     Print brief help; includes version number and summary  of  all  options,  including
              expert options.

       -o <f> Save  the  high-scoring  alignments of hits to a file <f>.  The default is to write
              them to standard output.

       -g <f> Turn on the  'glocal'  alignment  algorithm,  local  with  respect  to  the  target
              database,  and  global  with  respect to the model. By default, the local alignment
              algorithm is used which is local with respect to both the target sequence  and  the
              model.  In  local mode, the alignment to span two or more subsequences if necessary
              (e.g. if the structures of the query model and target sequence are  only  partially
              shared),  allowing  certain  large  insertions and deletions in the structure to be
              penalized differently than normal indels.  Local mode performs better on  empirical
              benchmarks  and  is  significantly  more  sensitive  for remote homology detection.
              Empirically, glocal searches return many fewer hits than local searches, so  glocal
              may be desired for some applications.

       -p     Append  posterior  probabilities  to  alignments  of  hits. For more information on
              posterior probabilities see the description of the -p option in the manual page for
              cmalign.

       -x     Annotate negative scoring basepairs and basepairs that include a gap in the left or
              right half of the pair (but not both) with x's in the alignments of hits.  The  x's
              appear  above  the structural annotation in the alignment output. Basepairs without
              x's above them are compensatory with respect to the model.  Compensatory  mutations
              are good evidence for structural homology.

       -v     Very similar to -x, but only mark negative scoring basepairs that are non-canonical
              basepairs (not an A:U, U:A, C:G, G:C, G:U or U:G), and mark them with a 'v' instead
              of an 'x' in the output.

       -Z <x> Calculate  E-values  as  if the target database size was <x> megabases (Mb). Ignore
              the actual size of the database. This option is only valid if the CM file has  been
              calibrated.  Warning:  the  predictions  for timings and survival fractions will be
              calculated as if the database was  of  size  <x>  Mb,  which  means  they  will  be
              inaccurate.

       --toponly
              Only  search the top (Watson) strand of the sequences in seqfile.  By default, both
              strands are searched.

       --bottomonly
              Only search the bottom (Crick) strand of the sequences  in  seqfile.   By  default,
              both strands are searched.

       --forecast <n>
              Predict the running time of the search with provided files and options and exit, DO
              NOT perform the search. This option is only available with calibrated CM files. The
              predictions  should  be  used  as  rough  estimates  and  can be fairly inaccurate,
              especially for highly biased target databases (for example  80%  AT  genomes).  The
              value  for  <n> is the number of processors the search will be run on, so <n> equal
              to 1 is appropriate unless you will run cmsearch in parallel with MPI.

       --informat <s>
              Assert that the input seqfile is in  format  <s>.   Do  not  run  Babelfish  format
              autodection.  This  increases  the reliability of the program somewhat, because the
              Babelfish  can  make  mistakes;  particularly  recommended  for  unattended,  high-
              throughput  runs  of  Infernal.   <s> is case-insensitive.  Acceptable formats are:
              FASTA, EMBL, UNIPROT, GENBANK, and DDBJ.  <s> is case-insensitive.

       --mxsize <x>
              Set the maximum allowable DP matrix size to <x> megabytes. By default this size  is
              2,048 Mb.  This should be large enough for the vast majority of alignments, however
              if it is not cmsearch will exit prematurely and report an error  message  that  the
              matrix exceeded it's maximum allowable size. In this case, the --mxsize can be used
              to raise the limit.

       --devhelp
              Print help, as with -h , but also include  undocumented  developer  options.  These
              options  are  not  listed below, are under development or experimental, and are not
              guaranteed to even work correctly. Use developer options at your own risk. The only
              resources   for  understanding  what  they  actually  do  are  the  brief  one-line
              description printed when --devhelp is enabled, and the source code.

       --mpi  Run as an MPI parallel program. This option will only be available if Infernal  has
              been  configured  and  built  with  the  "--enable-mpi"  flag (see User's Guide for
              details).

EXPERT OPTIONS

       --inside
              Use the Inside algorithm for the final round of searching. This is true by default.

       --cyk  Use the CYK algorithm for the final round of searching.

       --forward
              Search only with an HMM. This is much faster but less sensitive than a  CM  search.
              Use the Forward algorithm for the HMM search.

       --viterbi
              Search  only  with an HMM. This is much faster but less sensitive than a CM search.
              Use the Viterbi algorithm for the HMM search.

       -E <x> Set the E-value cutoff for the per-sequence/strand ranked hit list  to  <x>,  where
              <x>  is a positive real number. Hits with E-values better than (less than) or equal
              to this threshold will be shown. This option is only available if the CM  file  has
              been  calibrated.  This  threshold is relevant only to the final round of searching
              performed after all filters have been used, not to the filter rounds themselves.

       -T <x> Set the bit score cutoff for the per-sequence ranked hit list to <x>, where <x>  is
              a  positive  real  number.   Hits  with  bit scores better than (greater than) this
              threshold will be shown. This threshold is relevant only  to  the  final  round  of
              searching  performed  after  all  filters  have been used, not to the filter rounds
              themselves.

       --nc   Set the bit score cutoff as the NC cutoff value used by Rfam curators as the  noise
              cutoff  score.  This  is  the  highest  scoring hit found by this model during Rfam
              curation that the Rfam curators defined as a noise (false positive) sequence.   The
              NC  cutoff is defined as <x> bits in the original Stockholm alignment the model was
              built from with a line: #=GF NC <x> positioned before the  sequence  alignment.  If
              such  a line existed in the alignment provided to cmbuild then the --nc option will
              be available in cmsearch.  If no such line existed when cmbuild was run, then using
              the  --nc  option  to cmsearch will cause the program to print an error message and
              exit.

       --ga   Set the bit score cutoff as the GA cutoff  value  used  by  Rfam  curators  as  the
              gathering threshold. The GA cutoff is defined in a stockholm file used to build the
              model in the same way as the NC cutoff (see above), but with a line: #=GF GA <x>

       --tc   Set the bit score cutoff as the TC cutoff  value  used  by  Rfam  curators  as  the
              trusted  cutoff.  The  TC cutoff is defined in the stockholm file used to build the
              model in the same way as the NC cutoff (see above), but with a line: #=GF TC <x>

       --no-qdb
              Do not use query-dependent banding (QDB) for the final round of search. By default,
              QDB  is used in the final round of search with beta = 1E-15, after all filtering is
              finished.

       --beta <x>
              For query-dependent banding (QDB) during the final round of search,  set  the  beta
              parameter  to  <x> where <x> is any positive real number less than 1.0. Beta is the
              probability mass considered negligible during band calculation.  The  default  beta
              for the final round of search is 1E-15.

       --hbanded
              Use  HMM  bands  to  accelerate  the  final round of search. Constraints for the CM
              search  are  derived  from  posterior  probabilities  from  an  HMM.   This  is  an
              experimental  option and it is not recommended for use unless you know exactly what
              you're doing.

       --tau <x>
              Set the tail loss probability during HMM band calculation  to  <x>.   This  is  the
              amount  of  probability  mass  within  the  HMM  posterior  probabilities  that  is
              considered negligible. The default value is 1E-7.  In general, higher  values  will
              result  in  greater  acceleration,  but  increase the chance of missing the optimal
              alignment due to the HMM bands. This option only makes sense  in  combination  with
              --hbanded

       --fil-no-hmm
              Turn the HMM filter off.

       --fil-no-qdb
              Turn the QDB filter off.

       --fil-beta
              For  the  QDB  filter, set the beta parameter to <x> where <x> is any positive real
              number less than 1.0. Beta is the probability  mass  considered  negligible  during
              band calculation. The default beta for the QDB filter round of search is 1E-10.

       --fil-T-qdb <x>
              Set  the  bit score cutoff for the QDB filter round to <x>, where <x> is a positive
              real number.  Hits with bit scores better than (greater than) this  threshold  will
              survive the QDB filter and be passed to the final round.

       --fil-T-hmm <x>
              Set  the  bit score cutoff for the HMM filter round to <x>, where <x> is a positive
              real number.  Hits with bit scores better than (greater than) this  threshold  will
              survive  the HMM filter and be passed to the next round, either a QDB filter round,
              or if the QDB filter is disabled, to the final round of search.

       --fil-E-qdb <x>
              Set the E-value cutoff for the QDB filter round.  <x>, where <x> is a positive real
              number.  Hits with E-values better than (less than) or equal to this threshold will
              survive and be passed to the final round. This option is only available if  the  CM
              file has been calibrated.

       --fil-E-hmm <x>
              Set the E-value cutoff for the HMM filter round.  <x>, where <x> is a positive real
              number. Hits with E-values better than (less than) or equal to this threshold  will
              survive  and  be passed to the next round, either a QDB filter round, or if the QDB
              filter is disable, to the final round of search. This option is only  available  if
              the CM file has been calibrated.

       --fil-S-hmm <x>
              Set  the  bit  score cutoff for the HMM filter round as the score that will allow a
              predicted <x> fraction of the database to survive the HMM filter round,  where  <x>
              is a positive real number between 0 and 1.

       --fil-Smax-hmm <x>
              When  using  automatically  calibrated HMM thresholds for a CM file calibrated with
              cmcalibrate, set the maximum HMM filter threshold as the score that  will  allow  a
              predicted  <x>  fraction  of  the  database to survive the filter. If the automatic
              threshold from cmcalibrate exceeds this value, turn the HMM filter off and  do  not
              use  it for the search. By default, this option is ON with the default value of 0.5
              used for <x>.  To modify the behavior of this option so it does not  turn  off  the
              HMM filter if exceeded use the --fil-A-hmm option described below.

       --fil-Smin-hmm <x>
              When  using  automatically  calibrated HMM thresholds for a CM file calibrated with
              cmcalibrate, set the minimum HMM filter threshold as the score that  will  allow  a
              predicted  <x>  fraction  of  the  database to survive the filter. By default, this
              option is ON with the default value of 0.02 used for <x>.  Setting <x>  lower  will
              only accelerate the majority of searches by a small amount.

       --fil-A-hmm
              Always  enforce  the  maximum  HMM filter threshold of <x> from --fil-Smax-hmm <x>.
              That is, never turn off the HMM filter, or set its threshold above the  score  that
              will  allow a predicted <x> fraction of the database to survive. This option is OFF
              by default.

       --hmm-W <n>
              Set the HMM window size W (maximum size of a hit) to <n>.  This option  only  works
              in   combination  with  --forward  or  --viterbi.   By  default,  W  is  calculated
              automatically, but this automatic calculation is time consuming for large models.

       --hmm-cW <x>
              Set the HMM window size W (maximum size of a hit) as <x> times the consensus length
              of the CM. The consensus length (clen) of the CM can be determined using the cmstat
              program.  This option only works in combination with --forward  or  --viterbi.   By
              default,  W  is  calculated  automatically,  but this automatic calculation is time
              consuming for large models. To find potential full length hits  to  the  model  <x>
              should be greater than 1.0, but values above 2.0 are probably wasteful.

       --noalign
              Do not calculate and print alignments of each hit, only print locations and scores.

       --aln-hbanded
              Use HMM bands to accelerate alignment during the hit alignment stage.

       --aln-optacc
              Calculate  alignments of hits from final round of search using the optimal accuracy
              algorithm  which  computes  the  alignment  that  maximizes  the  summed  posterior
              probability  of  all  aligned residues given the model, which can be different from
              the highest scoring one.

       --tabfile <f>
              Create a new output file <f> and print tabular results to it.  The  format  of  the
              tabular  results  is  listed in the OUTPUT section. The tabular results can be more
              easily parsed by scripts than the default cmsearch output. The  esl-sfetch  miniapp
              included  in  the  easel/miniapps/  subdirectory of infernal has a --tabfile option
              that allows it to read cmsearch tab files and fetch the hits reported  within  them
              from the target database into a new sequence file.

       --gcfile <f>
              Create  a  new  output  file  <f>  and  print  statistics  of the GC content of the
              sequences in seqfile to it.   The  sequences  are  partitioned  into  100  nt  non-
              overlapping  windows,  and  the  GC  percentage  of  each  window  is calculated. A
              normalized histogram of those GC percentages is then printed to <f> This  file  can
              be generated even if cmsearch is run with --forecast and no search is performed.

       --rna  Output the hit alignments as RNA sequences alignments. This is true by default.

       --dna  Output the hit alignments as DNA sequence alignments.

SEE ALSO

       For  complete  documentation,  see  the  User's  Guide  (Userguide.pdf) that came with the
       distribution; or see the Infernal web page, http://infernal.janelia.org/.

COPYRIGHT

       Copyright (C) 2009 HHMI Janelia Farm Research Campus.
       Freely distributed under the GNU General Public License (GPLv3).
       See the file COPYING that came with the source for details on redistribution conditions.

AUTHOR

       Eric Nawrocki, Diana Kolbe, and Sean Eddy
       HHMI Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147
       http://selab.janelia.org/