Provided by: infernal_1.1.5-3_amd64 bug

NAME

       cmsearch - search covariance model(s) against a sequence database

SYNOPSIS

       cmsearch [options] <cmfile> <seqdb>

DESCRIPTION

       cmsearch  is  used  to  search  one  or  more  covariance  models (CMs) against a sequence
       database.  For each CM in <cmfile>, use that query CM to search  the  target  database  of
       sequences  in  <seqdb>, and output ranked lists of the sequences with the most significant
       matches to the CM.  To build CMs from multiple alignments, see cmbuild.

       The query <cmfile> must have been calibrated for E-values with cmcalibrate.  As a  special
       exception,  any  models  in  <cmfile> that have zero basepairs need not be calibrated. For
       these models, profile HMM search algorithms will be used instead of CM ones, as  discussed
       further below.

       The query <cmfile> may be '-' (a dash character), in which case the query CM input will be
       read from a <stdin> pipe instead of from a file. The <seqdb> may not be  '-'  because  the
       current implementation needs to be able to rewind the database, which is not possible with
       stdin input.

       The output format is designed to be  human-readable,  but  is  often  so  voluminous  that
       reading it is impractical, and parsing it is a pain. The --tblout option saves output in a
       simple tabular format that  is  concise  and  easier  to  parse.   The  -o  option  allows
       redirecting the main output, including throwing it away in /dev/null.

       cmsearch reexamines the 5' and 3' termini of target sequences using specialized algorithms
       for detection of truncated hits, in which part of the 5' and/or 3' end of the actual  full
       length  homologous  sequence  is  missing in the target sequence file. These types of hits
       will be most common in sequence files  consisting  of  unassembled  sequencing  reads.  By
       default,  any  5'  truncated  hit  is  required to include the first residue of the target
       sequence it derives from in <seqdb>, and any 3' truncated hit is required to  include  the
       final  residue  of  the  target sequence it derives from. Any 5' and 3' truncated hit must
       include the first and final residue of the target sequence it derives from. The --anytrunc
       option  will relax the requirements for hit inclusion of sequence endpoints, and truncated
       hits are allowed to start and stop at  any  positions  of  target  sequences.  Importantly
       though, with --anytrunc, hit E-values will be less accurate because model calibration does
       not consider the possibility of truncated hits, so use it  with  caution.   The  --notrunc
       option can be used to turn off truncated hit detection.  --notrunc will reduce the running
       time of cmsearch, most significantly for target <seqdb>  files  that  include  many  short
       sequences.

       Truncated  hit  detection  is  automatically turned off when the --max, --nohmm, --qdb, or
       --nonbanded options are used because it relies on the use of  an  accelerated  HMM  banded
       alignment strategy that is turned off by any of those options.

OPTIONS

       -h     Help; print a brief reminder of command line usage and all available options.

       -g     Turn  on the glocal alignment algorithm, global with respect to the query model and
              local with respect  to  the  target  database.  By  default,  the  local  alignment
              algorithm  is  used which is local with respect to both the target sequence and the
              model. In local mode, the alignment to span two or more subsequences  if  necessary
              (e.g.  if  the structures of the query model and target sequence are only partially
              shared), allowing certain large insertions and deletions in  the  structure  to  be
              penalized  differently  than normal indels. Local mode performs better on empirical
              benchmarks and is significantly  more  sensitive  for  remote  homology  detection.
              Empirically,  glocal searches return many fewer hits than local searches, so glocal
              may be desired for some applications. With -g, all models must be calibrated,  even
              those with zero basepairs.

       -Z <x> Calculate  E-values as if the search space size was <x> megabases (Mb). Without the
              use of this option, the search space  size  is  defined  as  the  total  number  of
              nucleotides  in  <seqdb> times 2, because both strands of each target sequence will
              be searched.

       --devhelp
              Print help, as with -h , but also include expert options  that  are  not  displayed
              with  -h  .   These  expert  options  are  not expected to be relevant for the vast
              majority of users and so are not described in the manual page.  The only  resources
              for  understanding what they actually do are the brief one-line descriptions output
              when --devhelp is enabled, and the source code.

OPTIONS FOR CONTROLLING OUTPUT

       -o <f> Direct the main human-readable output to a file <f> instead of the default stdout.

       -A <f> Save a multiple alignment of  all  significant  hits  (those  satisfying  inclusion
              thresholds) to the file <f>.

       --tblout <f>
              Save  a  simple tabular (space-delimited) file summarizing the hits found, with one
              data line per hit. The format of this  file  is  described  in  section  6  of  the
              Infernal user guide.

       --acc  Use  accessions  instead  of names in the main output, where available for profiles
              and/or sequences.

       --noali
              Omit the alignment section from the main output. This can greatly reduce the output
              volume.

       --notextw
              Unlimit  the  length of each line in the main output. The default is a limit of 120
              characters per line, which helps in displaying the output cleanly on terminals  and
              in editors, but can truncate target profile description lines.

       --textw <n>
              Set  the main output's line length limit to <n> characters per line. The default is
              120.

       --verbose
              Include extra search pipeline statistics  in  the  main  output,  including  filter
              survival  statistics  for truncated hit detection and number of envelopes discarded
              due to matrix size overflows.

       --nomiss
              With -A, in the output alignment, do not mark  truncated  hits  as  fragmentary  by
              using  missing data '~' characters before the first residue (for 5' truncated hits)
              and/or after the final residue (for 3' truncated hits). By  default,  missing  data
              characters  ('~') will be used in this way to mark truncated hits as fragments. The
              cmbuild program can use these fragment definitions in input alignment files if  the
              --fraggiven option is used.

OPTIONS CONTROLLING REPORTING THRESHOLDS

       Reporting  thresholds control which hits are reported in output files (the main output and
       --tblout) Hits are ranked by statistical significance (E-value).   By  default,  all  hits
       with an E-value <= 10 are reported.  The following options allow you to change the default
       E-value reporting thresholds, or to use bit score thresholds instead.

       -E <x> In the per-target output, report target sequences with an E-value of <=  <x>.   The
              default is 10.0, meaning that on average, about 10 false positives will be reported
              per query, so you can see the top of the noise and  decide  for  yourself  if  it's
              really noise.

       -T <x> Instead  of  thresholding  per-CM output on E-value, report target sequences with a
              bit score of >= <x>.

OPTIONS FOR INCLUSION THRESHOLDS

       Inclusion thresholds are stricter than reporting thresholds.  Inclusion thresholds control
       which  hits  are considered to be reliable enough to be included in an output alignment or
       in a possible subsequent search round, or  marked  as  significant  ("!")  as  opposed  to
       questionable ("?") in hit output.

       --incE <x>
              Use  an  E-value  of  <=  <x> as the hit inclusion threshold.  The default is 0.01,
              meaning that on average, about 1 false positive would  be  expected  in  every  100
              searches with different query sequences.

       --incT <x>
              Instead  of  using  E-values for setting the inclusion threshold, instead use a bit
              score of >= <x> as the hit inclusion threshold.  By default this option is unset.

OPTIONS FOR MODEL-SPECIFIC SCORE THRESHOLDING

       Curated CM databases may define specific bit score thresholds for each CM, superseding any
       thresholding based on statistical significance alone.

       To  use  these  options,  the  profile  must  contain  the appropriate (GA, TC, and/or NC)
       optional score threshold annotation; this is picked up by cmbuild  from  Stockholm  format
       alignment  files.  Each thresholding option has a score of <x> bits, and acts as if -T <x>
       --incT <x> has been applied specifically using each model's curated thresholds.

       --cut_ga
              Use the GA (gathering) bit scores in the model to set hit reporting  and  inclusion
              thresholds.  GA  thresholds  are  generally  considered  to be the reliable curated
              thresholds defining family membership;  for  example,  in  Rfam,  these  thresholds
              define  what gets included in Rfam Full alignments based on searches with Rfam Seed
              models.

       --cut_nc
              Use the NC (noise cutoff) bit score thresholds in the model to  set  hit  reporting
              and inclusion thresholds. NC thresholds are generally considered to be the score of
              the highest-scoring known false positive.

       --cut_tc
              Use the TC (trusted cutoff) bit score thresholds in the model to set hit  reporting
              and inclusion thresholds. TC thresholds are generally considered to be the score of
              the lowest-scoring known true positive that is above all known false positives.

OPTIONS CONTROLLING THE ACCELERATION PIPELINE

       Infernal 1.1 searches are accelerated in a  six-stage  filter  pipeline.  The  first  five
       stages  use  a  profile  HMM  to  define envelopes that are passed to the stage six CM CYK
       filter. Any envelopes that survive all filters are assigned final scores using the the  CM
       Inside algorithm. (See the user guide for more information.)

       The profile HMM filter is built by the cmbuild program and is stored in <cmfile>.

       Each  successive  filter  is  slower  than  the  previous  one,  but  better  than  it  at
       disciminating between subsequences that may contain high-scoring CM hits and those that do
       not. The first three HMM filter stages are the same as those used in HMMER3.  Stage 1 (F1)
       is the local HMM SSV filter modified for long sequences. Stage 2 (F2)  is  the  local  HMM
       Viterbi  filter.  Stage  3  (F3)  is the local HMM Forward filter. Each of the first three
       stages uses the profile HMM in local mode, which allows a target subsequence to  align  to
       any  region  of  the  HMM.  Stage  4  (F4) is a glocal HMM filter, which requires a target
       subsequence to align to the full-length profile HMM.  Stage  5  (F5)  is  the  glocal  HMM
       envelope definition filter, which uses HMMER3's domain identification heursitics to define
       envelope boundaries. After each stage from 2 to 5 a bias filter step (F2b, F3b,  F4b,  and
       F5b)  is  used  to  remove  sequences  that appear to have passed the filter due to biased
       composition alone. Any envelopes that survive stages F1 through F5b are then  passed  with
       the  local  CM  CYK  filter.  The  CYK filter uses constraints (bands) derived from an HMM
       alignment of the envelope to reduce the number of required  calculations  and  save  time.
       Any envelopes that pass CYK are scored with the local CM Inside algorithm, again using HMM
       bands for acceleration.

       The default filter thresholds that define the minimum score required for a subsequence  to
       survive  each  stage are defined based on the size of the database in <seqdb> (or the size
       <x> in megabases (Mb) specified by the -Z <x> or --FZ <x> options).  For larger databases,
       the filters are more strict leading to more acceleration but potentially a greater loss of
       sensitivity. The rationale is that for larger databases, hits must have higher  scores  to
       achieve  statistical  significance,  so  stricter  filtering  that  removes  lower scoring
       insignificant hits is acceptable.

       The P-value thresholds for all possible search space  sizes  and  all  filter  stages  are
       listed  next.  (A  P-value  threshold of 0.01 means that roughly 1% of the highest scoring
       nonhomologous subsequence are expected to pass the filter.) Z is defined as the number  of
       nucleotides  in  the  complete  target  sequence file times 2 because both strands will be
       searched with each model.

       If Z is less than 2 Mb: F1 is 0.35; F2 and F2b are off; F3, F3b, F4, F4b and F5 are  0.02;
       F6 is 0.0001.

       If  Z  is  between 2 Mb and 20 Mb: F1 is 0.35; F2 and F2b are off; F3, F3b, F4, F4b and F5
       are 0.005; F6 is 0.0001.

       If Z is between 20 Mb and 200 Mb: F1 is 0.35; F2 and F2b are 0.15; F3, F3b, F4, F4b and F5
       are 0.003; F6 is 0.0001.

       If  Z  is  between 200 Mb and 2 Gb: F1 is 0.15; F2 and F2b are 0.15; F3, F3b, F4, F4b, F5,
       and F5b are 0.0008; and F6 is 0.0001.

       If Z is between 2 Gb and 20 Gb: F1 is 0.15; F2 and F2b are 0.15; F3, F3b, F4, F4b, F5, and
       F5b are 0.0002; and F6 is 0.0001.

       If  Z  is  more than 20 Gb: F1 is 0.06; F2 and F2b are 0.02; F3, F3b, F4, F4b, F5, and F5b
       are 0.0002; and F6 is 0.0001.

       These thresholds were chosen based on performance on an internal  benchmark  testing  many
       different possible settings.

       There  are five options for controlling the general filtering level. These options are, in
       order from least strict (slowest but most sensitive) to most  strict  (fastest  but  least
       sensitive):  --max, --nohmm, --mid, --default, (this is the default setting), --rfam.  and
       --hmmonly.  With --default the filter thresholds will be database-size dependent. See  the
       explanation of each of these individual options below for more information.

       Additionally,  an expert user can precisely control each filter stage score threshold with
       the --F1, --F1b, --F2, --F2b, --F3, --F3b, --F4, --F4b, --F5, --F5b, and --F6 options.  As
       well  as  turn  each  stage  on  or off with the --noF1, --doF1b, --noF2, --noF2b, --noF3,
       --noF3b, --noF4, --noF4b, --noF5, and --noF6.  options.  These options are only  displayed
       if  the  --devhelp  option  is  used  to  keep  the  number  of  displayed options with -h
       reasonable, and because they are only expected to be useful to a small minority of users.

       As a special case, for any models in <cmfile>  which  have  zero  basepairs,  profile  HMM
       searches  are  run  instead  of  CM  searches.  HMM  algorithms are more efficient than CM
       algorithms, and the benefit of  CM  algorithms  is  lost  for  models  with  no  secondary
       structure  (zero basepairs). These profile HMM searches will run significantly faster than
       the CM searches. You can force HMM-only searches  with  the  --hmmonly  option.  For  more
       information  on  HMM-only  searches see the description of the --hmmonly option below, and
       the user guide.

       --max  Turn off all filters,  and  run  non-banded  Inside  on  every  full-length  target
              sequence. This increases sensitivity somewhat, at an extremely large cost in speed.

       --nohmm
              Turn  off  all HMM filter stages (F1 through F5b). The CYK filter, using QDBs, will
              be run on every full-length target sequence and will enforce a P-value threshold of
              0.0001.  Each  subsequence  that  survives CYK will be passed to Inside, which will
              also use QDBs (but a looser set). This increases sensitivity somewhat,  at  a  very
              large cost in speed.

       --mid  Turn off the HMM SSV and Viterbi filter stages (F1 through F2b).  Set remaining HMM
              filter thresholds (F3 through F5b) to 0.02 by default, but changeable to  <x>  with
              --Fmid <x> sequence. This may increase sensitivity, at a significant cost in speed.

       --default
              Use  the  default  filtering  strategy.  This  option  is on by default. The filter
              thresholds are determined based on the database size.

       --rfam Use a strict filtering strategy devised for large databases (more than 20 Gb). This
              will  accelerate  the  search  at  a potential cost to sensitivity. It will have no
              effect if the database is larger than 20 Gb.

       --hmmonly
              Only use the filter profile HMM for searches, do  not  use  the  CM.   Only  filter
              stages  F1  through  F3 will be executed, using strict P-value thresholds (0.02 for
              F1, 0.001 for F2 and 0.00001 for F3).  Additionally a bias  composition  filter  is
              used  after  the  F1 stage (with P=0.02 survival threshold).  Any hit that survives
              all stages and has an HMM E-value or bit score above the reporting  threshold  will
              be  output.   The  user  can change the HMM-only filter thresholds and options with
              --hmmF1, --hmmF2, --hmmF3, --hmmnobias, --hmmnonull2, and  --hmmmax.   By  default,
              searches  for  any model with zero basepairs will be run in HMM-only mode. This can
              be turned off, forcing CM searches for these models with  the  --nohmmonly  option.
              These options are only displayed if the --devhelp option is used.

       --FZ <x>
              Set filter thresholds as the defaults used if the database were <x> megabases (Mb).
              If used with <x> greater than 20000 (20 Gb) this option  has  the  same  effect  as
              --rfam.

       --Fmid <x>
              With  the  --mid  option set the HMM filter thresholds (F3 through F5b) to <x>.  By
              default, <x> is 0.02.

OTHER OPTIONS

       --notrunc
              Turn off truncated hit detection.

       --anytrunc
              Allow truncated hits to begin and end at any position  in  a  target  sequence.  By
              default,  5' truncated hits must include the first residue of their target sequence
              and 3' truncated hits must include the final residue of their target sequence. With
              this option you may observe fewer full length hits that extend to the beginning and
              end of the query CM. As of version 1.1.5,  truncated  hits  that  end  at  sequence
              terminii  with  a  lower  score  penalty  than  internally  truncated hits are also
              considered (these were not  considered  in  1.1x  versions  prior  to  1.1.5).   To
              reproduce  the  behavior  of  this  option  from  v1.1.4, use the --inttrunc option
              instead.

       --nonull3
              Turn off the null3 CM score corrections for biased composition. This correction  is
              not used during the HMM filter stages.

       --mxsize <x>
              Set  the maximum allowable CM DP matrix size to <x> megabytes. By default this size
              is 128 Mb.  This should  be  large  enough  for  the  vast  majority  of  searches,
              especially  with  smaller models.  If cmsearch encounters an envelope in the CYK or
              Inside stage that requires a larger matrix, the envelope will  be  discounted  from
              consideration.  This  behavior is like an additional filter that prevents expensive
              (slow) CM DP calculations, but at a potential cost to sensitivity.   Note  that  if
              cmsearch  is  being  run  in  <n> multiple threads on a multicore machine then each
              thread may have an allocated matrix of up to size <x> Mb at any given time.

       --smxsize <x>
              Set the maximum allowable CM search DP matrix size to  <x>  megabytes.  By  default
              this  size  is  128  Mb.   This  option is only relevant if the CM will not use HMM
              banded matrices, i.e.  if  the  --max,  --nohmm,  --qdb,  --fqdb,  --nonbanded,  or
              --fnonbanded  options  are  also  used.  Note  that if cmsearch is being run in <n>
              multiple threads on a multicore machine then each  thread  may  have  an  allocated
              matrix of up to size <x> Mb at any given time.

       --cyk  Use the CYK algorithm, not Inside, to determine the final score of all hits.

       --acyk Use the CYK algorithm to align hits. By default, the Durbin/Holmes optimal accuracy
              algorithm is used, which finds the alignment that maximizes the  expected  accuracy
              of all aligned residues.

       --wcx <x>
              For  each  CM,  set  the  W parameter, the expected maximum length of a hit, to <x>
              times the consensus length of the model. By default, the W parameter is  read  from
              the  CM  file and was calculated based on the transition probabilities of the model
              by cmbuild.  You can find out what the default W is for a model using cmstat.  This
              option  should be used with caution as it impacts the filtering pipeline at several
              different stages in nonobvious ways.  It  is  only  recommended  for  expert  users
              searching  for hits that are much longer than any of the homologs used to build the
              model in cmbuild, e.g. ones with large introns or  other  large  insertions.   This
              option  cannot  be  used  in  combination with the --nohmm, --fqdb or --qdb options
              because in those cases W is limited by query-dependent bands.

       --toponly
              Only search the top (Watson) strand of target sequences in  <seqdb>.   By  default,
              both strands are searched. This will halve the database size (Z).

       --bottomonly
              Only  search the bottom (Crick) strand of target sequences in <seqdb>.  By default,
              both strands are searched.  This will halve the database size (Z).

       --tformat <s>
              Assert that the target sequence database file is in format <s>.   Accepted  formats
              include  fasta, embl, genbank, ddbj, stockholm, pfam, a2m, afa, clustal, and phylip
              The default is to autodetect the format of the file.

       --cpu <n>
              Set the number of parallel worker threads  to  <n>.   On  multicore  machines,  the
              default is 4.  You can also control this number by setting an environment variable,
              INFERNAL_NCPU.  There is also a master thread, so the actual number of threads that
              Infernal  spawns  is  <n>+1.  This option is not available if Infernal was compiled
              with POSIX threads support turned off.

       --stall
              For debugging the MPI master/worker version:  pause  after  start,  to  enable  the
              developer  to  attach debuggers to the running master and worker(s) processes. Send
              SIGCONT signal to release the pause.   (Under  gdb:  (gdb)  signal  SIGCONT)  (Only
              available if optional MPI support was enabled at compile-time.)

       --mpi  Run  in MPI master/worker mode, using mpirun.  To use --mpi, the sequence file must
              have first been 'indexed' using the esl-sfetch  program,  which  is  included  with
              Infernal,  in  the  easel/miniapps/  subdirectory.  (Only available if optional MPI
              support was enabled at compile-time.)

SEE ALSO

       See infernal(1) for a master man page with a list of all  the  individual  man  pages  for
       programs in the Infernal package.

       For  complete  documentation, see the user guide that came with your Infernal distribution
       (Userguide.pdf); or see the Infernal web page (http://eddylab.org/infernal/).

COPYRIGHT

       Copyright (C) 2023 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For additional information on copyright and licensing, see the file  called  COPYRIGHT  in
       your    Infernal    source    distribution,    or    see    the    Infernal    web    page
       (http://eddylab.org/infernal/).

AUTHOR

       http://eddylab.org