xenial (1) cmscan.1.gz

Provided by: infernal_1.1.1-3_amd64 bug

NAME

       cmscan - search sequence(s) against a covariance model database

SYNOPSIS

       cmscan [options] <cmdb> <seqfile>

DESCRIPTION

       cmscan  is  used  to  search  sequences  against  collections of covariance models.  For each sequence in
       <seqfile>, use that query sequence to search the target database of CMs  in  <cmdb>,  and  output  ranked
       lists of the CMs with the most significant matches to the sequence.

       The  <seqfile>  may  contain  more  than  one query sequence. It can be in FASTA format, or several other
       common sequence file formats (genbank, embl, and among others), or in alignment file formats  (stockholm,
       aligned fasta, and others). See the --qformat option for a complete list.

       The  <cmdb>  needs to be press'ed using cmpress before it can be searched with cmscan.  This creates four
       binary files, suffixed .i1{fimp}.  Additionally, <cmdb> must  have  been  calibrated  for  E-values  with
       cmcalibrate before being press'ed with cmpress.

       The  query  <seqfile>  may  be  '-' (a dash character), in which case the query sequences are read from a
       <stdin> pipe instead of from a file.  The <cmdb> cannot be read from a <stdin> stream, because  it  needs
       to have those four auxiliary binary files generated by cmpress.

       The  output  format  is  designed  to  be  human-readable,  but is often so voluminous that reading it is
       impractical, and parsing it is a pain. The --tblout option saves output in a simple tabular  format  that
       is  concise and easier to parse.  The -o option allows redirecting the main output, including throwing it
       away in /dev/null.

       cmscan reexamines the 5' and 3' termini of target sequences using specialized algorithms for detection of
       truncated  hits,  in  which part of the 5' and/or 3' end of the actual full length homologous sequence is
       missing in the target sequence file. These types of hits will be most common in sequence files consisting
       of  unassembled  sequencing  reads.  By  default,  any  5' truncated hit is required to include the first
       residue of the target sequence it derives from in <seqfile>, and any 3'  truncated  hit  is  required  to
       include  the  final  residue  of  the  target  sequence it derives from. Any 5' and 3' truncated hit must
       include the first and final residue of the target sequence it derives from. The  --anytrunc  option  will
       relax  the  requirements for hit inclusion of sequence endpoints, and truncated hits are allowed to start
       and stop at any positions of target sequences.  Importantly though, with --anytrunc, hit E-values will be
       less  accurate  because  model calibration does not consider the possibility of truncated hits, so use it
       with caution.  The --notrunc option can be used to turn off  truncated  hit  detection.   --notrunc  will
       reduce  the running time of cmscan, most significantly for target <seqfile> files that include many short
       sequences.  Truncated hit detection is automatically turned  off  when  the  --max,  --nohmm,  --qdb,  or
       --nonbanded options are used because it relies on the use of an accelerated HMM banded alignment strategy
       that is turned off by any of those options.

OPTIONS

       -h     Help; print a brief reminder of command line usage and all available options.

       -g     Turn on the glocal alignment algorithm, global with respect to the  query  model  and  local  with
              respect  to  the target database. By default, the local alignment algorithm is used which is local
              with respect to both the target sequence and the model. In local mode, the alignment to  span  two
              or  more  subsequences if necessary (e.g. if the structures of the query model and target sequence
              are only partially shared), allowing certain large insertions and deletions in the structure to be
              penalized  differently  than normal indels. Local mode performs better on empirical benchmarks and
              is significantly more sensitive for remote homology detection. Empirically, glocal searches return
              many fewer hits than local searches, so glocal may be desired for some applications.

       -Z <x> Calculate  E-values  as  if  the search space size was <x> megabases (Mb). Without the use of this
              option, the search space size changes for each query sequence, it is defined as the length of  the
              current  query  sequence times 2 (because both strands of the sequence will be searched) times the
              number of CMs in <cmdb>.

       --devhelp
              Print help, as with -h , but also include expert options that are not displayed with -h  .   These
              expert  options  are  not  expected  to  be relevant for the vast majority of users and so are not
              described in the manual page.  The only resources for understanding what they actually do are  the
              brief one-line descriptions output when --devhelp is enabled, and the source code.

OPTIONS FOR CONTROLLING OUTPUT

       -o <f> Direct the main human-readable output to a file <f> instead of the default stdout.

       --tblout <f>
              Save  a  simple  tabular (space-delimited) file summarizing the hits found, with one data line per
              hit.  The format of this file is described in the Infernal user guide.

       --acc  Use accessions instead of names in the main output, where available for profiles and/or sequences.

       --noali
              Omit the alignment section from the main output. This can greatly reduce the output volume.

       --notextw
              Unlimit the length of each line in the main output. The default is a limit of 120  characters  per
              line,  which  helps in displaying the output cleanly on terminals and in editors, but can truncate
              target profile description lines.

       --textw <n>
              Set the main output's line length limit to <n> characters per line. The default is 120.

       --verbose
              Include extra search pipeline statistics in the main output, including filter survival  statistics
              for truncated hit detection and number of envelopes discarded due to matrix size overflows.

OPTIONS CONTROLLING REPORTING THRESHOLDS

       Reporting  thresholds control which hits are reported in output files (the main output and --tblout) Hits
       are ranked by statistical significance (E-value).  By default,  all  hits  with  an  E-value  <=  10  are
       reported.   The following options allow you to change the default E-value reporting thresholds, or to use
       bit score thresholds instead.

       -E <x> In the per-target output, report target sequences with an E-value of <= <x>.  The default is 10.0,
              meaning  that  on average, about 10 false positives will be reported per query, so you can see the
              top of the noise and decide for yourself if it's really noise.

       -T <x> Instead of thresholding per-CM output on E-value, report target sequences with a bit score  of  >=
              <x>.

OPTIONS FOR INCLUSION THRESHOLDS

       Inclusion thresholds are stricter than reporting thresholds.  Inclusion thresholds control which hits are
       considered to be reliable enough to be included in a possible  subsequent  search  round,  or  marked  as
       significant ("!") as opposed to questionable ("?") in hit output.

       --incE <x>
              Use  an  E-value  of  <= <x> as the hit inclusion threshold.  The default is 0.01, meaning that on
              average, about 1 false positive would be expected in  every  100  searches  with  different  query
              sequences.

       --incT <x>
              Instead  of  using E-values for setting the inclusion threshold, instead use a bit score of >= <x>
              as the hit inclusion threshold.  By default this option is unset.

OPTIONS FOR MODEL-SPECIFIC SCORE THRESHOLDING

       Curated CM databases may define specific bit score thresholds for each CM, superseding  any  thresholding
       based on statistical significance alone.

       To  use  these  options,  the  profile  must  contain  the appropriate (GA, TC, and/or NC) optional score
       threshold annotation; this  is  picked  up  by  cmbuild  from  Stockholm  format  alignment  files.  Each
       thresholding  option  has  a  score  of  <x>  bits,  and  acts  as  if -T <x> --incT <x> has been applied
       specifically using each model's curated thresholds.

       --cut_ga
              Use the GA (gathering) bit scores in the model to set hit reporting and inclusion  thresholds.  GA
              thresholds  are  generally  considered  to  be  the  reliable  curated  thresholds defining family
              membership; for example, in Rfam,  these  thresholds  define  what  gets  included  in  Rfam  Full
              alignments based on searches with Rfam Seed models.

       --cut_nc
              Use  the  NC  (noise  cutoff) bit score thresholds in the model to set hit reporting and inclusion
              thresholds. NC thresholds are generally considered to be the score of  the  highest-scoring  known
              false positive.

       --cut_tc
              Use  the  TC (trusted cutoff) bit score thresholds in the model to set hit reporting and inclusion
              thresholds. TC thresholds are generally considered to be the score  of  the  lowest-scoring  known
              true positive that is above all known false positives.

OPTIONS CONTROLLING THE ACCELERATION PIPELINE

       Infernal 1.1 searches are accelerated in a six-stage filter pipeline. The first five stages use a profile
       HMM to define envelopes that are passed to the stage six CM CYK filter. Any envelopes  that  survive  all
       filters are assigned final scores using the the CM Inside algorithm.

       The profile HMM filter is built by the cmbuild program and is stored in <cmfile>.

       Each  successive  filter  is  slower  than  the previous one, but better than it at disciminating between
       subsequences that may contain high-scoring CM hits and those that do not.  The  first  three  HMM  filter
       stages  are the same as those used in HMMER3.  Stage 1 (F1) is the local HMM SSV filter modified for long
       sequences. Stage 2 (F2) is the local HMM Viterbi filter. Stage 3 (F3) is the local  HMM  Forward  filter.
       Each  of  the first three stages uses the profile HMM in local mode, which allows a target subsequence to
       align to any region of the HMM. Stage 4 (F4) is a glocal HMM filter, which requires a target  subsequence
       to align to the full-length profile HMM. Stage 5 (F5) is the glocal HMM envelope definition filter, which
       uses HMMER3's domain identification heursitics to define envelope boundaries. After each stage from 2  to
       5  a bias filter step (F2b, F3b, F4b, and F5b) is used to remove sequences that appear to have passed the
       filter due to biased composition alone. Any envelopes that survive stages F1 through F5b are then  passed
       with  the  local  CM CYK filter. The CYK filter uses constraints (bands) derived from an HMM alignment of
       the envelope to reduce the number of required calculations and save time.  Any envelopes  that  pass  CYK
       are scored with the local CM Inside algorithm, again using HMM bands for acceleration.

       The  default  filter  thresholds that define the minimum score required for a subsequence to survive each
       stage are defined based on the size of the search space (Z), which  is  defined  as  the  length  of  the
       current  query  sequence  times 2 (because both strands will be searched) times the number of profiles in
       <cmdb>.  However, if either the -Z <x> or --FZ <x> options  are  used  then  the  search  space  will  be
       considered to be <x> for purposes of defining the filter thresholds.

       For  larger databases, the filters are more strict leading to more acceleration but potentially a greater
       loss of sensitivity. The rationale is that for larger databases, hits must have higher scores to  achieve
       statistical  significance,  so  stricter  filtering  that  removes  lower  scoring  insignificant hits is
       acceptable.

       The P-value thresholds for all possible search space sizes and all filter stages are listed next.  (A  P-
       value  threshold  of  0.01  means  that  roughly  1% of the highest scoring nonhomologous subsequence are
       expected to pass the filter.) Z is defined as the number of nucleotides in the complete  target  sequence
       file times 2 because both strands will be searched with each model.

       If Z is less than 2 Mb: F1 is 0.35; F2 and F2b are off; F3, F3b, F4, F4b and F5 are 0.02; F6 is 0.0001.

       If  Z is between 2 Mb and 20 Mb: F1 is 0.35; F2 and F2b are off; F3, F3b, F4, F4b and F5 are 0.005; F6 is
       0.0001.

       If Z is between 20 Mb and 200 Mb: F1 is 0.35; F2 and F2b are 0.15; F3, F3b, F4, F4b and F5 are 0.003;  F6
       is 0.0001.

       If  Z  is  between  200  Mb  and 2 Gb: F1 is 0.15; F2 and F2b are 0.15; F3, F3b, F4, F4b, F5, and F5b are
       0.0008; and F6 is 0.0001.

       If Z is between 2 Gb and 20 Gb: F1 is 0.15; F2 and F2b are 0.15; F3,  F3b,  F4,  F4b,  F5,  and  F5b  are
       0.0002; and F6 is 0.0001.

       If  Z  is more than 20 Gb: F1 is 0.06; F2 and F2b are 0.02; F3, F3b, F4, F4b, F5, and F5b are 0.0002; and
       F6 is 0.0001.

       These thresholds were chosen based on  performance  on  an  internal  benchmark  testing  many  different
       possible settings.

       There  are  five  options  for  controlling the general filtering level. These options are, in order from
       least strict (slowest but most sensitive) to most strict (fastest but least sensitive):  --max,  --nohmm,
       --mid,  --default,  (this  is  the  default  setting)  --rfam.  and --hmmonly.  With --default the filter
       thresholds will be database-size dependent. See the explanation of each of these individual options below
       for more information.

       Additionally,  an  expert  user  can  precisely  control each filter stage score threshold with the --F1,
       --F1b, --F2, --F2b, --F3, --F3b, --F4, --F4b, --F5, --F5b, and --F6 options. As well as turn  each  stage
       on  or  off  with  the  --noF1,  --doF1b,  --noF2, --noF2b, --noF3, --noF3b, --noF4, --noF4b, --noF5, and
       --noF6.  options.  These options are only displayed if the --devhelp option is used to keep the number of
       displayed options with -h reasonable, and because they are only expected to be useful to a small minority
       of users.

       As a special case, for any models in <cmfile> which have zero basepairs, profile  HMM  searches  are  run
       instead  of  CM  searches.  HMM  algorithms  are more efficient than CM algorithms, and the benefit of CM
       algorithms is lost for models with no secondary structure (zero basepairs). These  profile  HMM  searches
       will  run  significantly  faster than the CM searches. You can force HMM-only searches with the --hmmonly
       option. For more information on HMM-only searches see the user guide.

       --max  Turn off all filters, and run  non-banded  Inside  on  every  full-length  target  sequence.  This
              increases sensitivity somewhat, at an extremely large cost in speed.

       --nohmm
              Turn  off all HMM filter stages (F1 through F5b). The CYK filter, using QDBs, will be run on every
              full-length target sequence and will enforce a P-value threshold of 0.0001. Each subsequence  that
              survives CYK will be passed to Inside, which will also use QDBs (but a looser set). This increases
              sensitivity somewhat, at a very large cost in speed.

       --mid  Turn off the HMM SSV and Viterbi filter  stages  (F1  through  F2b).   Set  remaining  HMM  filter
              thresholds  (F3  through  F5b) to 0.02 by default, but changeable to <x> with --Fmid <x> sequence.
              This may increase sensitivity, at a significant cost in speed.

       --default
              Use the default filtering strategy. This option is  on  by  default.  The  filter  thresholds  are
              determined based on the database size.

       --rfam Use  a  strict  filtering  strategy  devised  for  large  databases  (more  than 20 Gb). This will
              accelerate the search at a potential cost to sensitivity.

       --hmmonly
              Only use the filter profile HMM for searches, do not use the CM.  Only filter stages F1 through F3
              will  be executed, using strict P-value thresholds (0.02 for F1, 0.001 for F2 and 0.00001 for F3).
              Additionally a bias  composition  filter  is  used  after  the  F1  stage  (with  P=0.02  survival
              threshold).   Any  hit  that  survives  all  stages  and has an HMM E-value or bit score above the
              reporting threshold will be output.  The user  can  change  the  HMM-only  filter  thresholds  and
              options  with  --hmmF1,  --hmmF2,  --hmmF3,  --hmmnobias, --hmmnonull2, and --hmmmax.  By default,
              searches for any model with zero basepairs will be run in HMM-only mode. This can be  turned  off,
              forcing CM searches for these models with the --nohmmonly option.

       --FZ <x>
              Set  filter  thresholds as the defaults used if the database were <x> megabases (Mb). If used with
              <x> greater than 20000 (20 Gb) this option has the same effect as --rfam.

       --Fmid <x>
              With the --mid option set the HMM filter thresholds (F3 through F5b) to <x>.  By default,  <x>  is
              0.02.

OTHER OPTIONS

       --notrunc
              Turn off truncated hit detection.

       --anytrunc
              Allow  truncated  hits  to  begin  and  end  at  any position in a target sequence. By default, 5'
              truncated hits must include the first residue of their target sequence and 3' truncated hits  must
              include  the  final  residue of their target sequence. With this option you may observe fewer full
              length hits that extend to the beginning and end of the query CM.

       --nonull3
              Turn off the null3 CM score corrections for biased composition. This correction is not used during
              the HMM filter stages.

       --mxsize <x>
              Set  the  maximum  allowable  CM  DP matrix size to <x> megabytes. By default this size is 128 Mb.
              This should be large enough for the vast majority of searches, especially with smaller models.  If
              cmsearch  encounters  an  envelope  in  the CYK or Inside stage that requires a larger matrix, the
              envelope will be discounted from consideration. This behavior is like an  additional  filter  that
              prevents  expensive  (slow) CM DP calculations, but at a potential cost to sensitivity.  Note that
              if cmsearch is being run in <n> multiple threads on a multicore machine then each thread may  have
              an allocated matrix of up to size <x> Mb at any given time.

       --smxsize <x>
              Set  the  maximum allowable CM search DP matrix size to <x> megabytes. By default this size is 128
              Mb.  This option is only relevant if the CM will not use HMM banded matrices, i.e. if  the  --max,
              --nohmm,  --qdb, --fqdb, --nonbanded, or --fnonbanded options are also used. Note that if cmsearch
              is being run in <n> multiple threads on a multicore machine then each thread may have an allocated
              matrix of up to size <x> Mb at any given time.

       --cyk  Use the CYK algorithm, not Inside, to determine the final score of all hits.

       --acyk Use  the  CYK algorithm to align hits. By default, the Durbin/Holmes optimal accuracy algorithm is
              used, which finds the alignment that maximizes the expected accuracy of all aligned residues.

       --wcx <x>
              For each CM, set the W parameter, the expected maximum length of a hit, to <x> times the consensus
              length of the model. By default, the W parameter is read from the CM file and was calculated based
              on the transition probabilities of the model by cmbuild.  You can find out what the default  W  is
              for  a  model  using  cmstat.  This option should be used with caution as it impacts the filtering
              pipeline at several different stages in nonobvious ways. It is only recommended for  expert  users
              searching  for  hits  that  are  much  longer  than any of the homologs used to build the model in
              cmbuild, e.g. ones with large introns or other large insertions.  It cannot be used in combination
              with  the  --nohmm, --fqdb or --qdb options because in those cases W is limited by query-dependent
              bands.

       --toponly
              Only search the top (Watson) strand of target sequences in <seqfile>.  By  default,  both  strands
              are searched. This will halve the search space size (Z).

       --bottomonly
              Only  search the bottom (Crick) strand of target sequences in <seqfile>.  By default, both strands
              are searched. This will halve the search space size (Z).

       --qformat <s>
              Assert that the query sequence database file is in format <s>.  Accepted  formats  include  fasta,
              embl,  genbank,  ddbj, stockholm, pfam, a2m, afa, clustal, and phylip The default is to autodetect
              the format of the file.

       --glist <f>
              Configure a subset of models from <cmfile> in glocal alignment mode, instead of local mode, namely
              the  models  listed  in  file  <f>.  Configure all other models (those not listed in <f>) in local
              mode.  This option is incompatible with -g.  File  <f>  must  list  valid  names  of  models  from
              <cmfile>, each separated by any whitespace character (e.g. a newline character).

       --cpu <n>
              Set the number of parallel worker threads to <n>.  By default, Infernal sets this to the number of
              CPU cores it detects in your machine - that is, it tries to maximize the  use  of  your  available
              processor  cores. Setting <n> higher than the number of available cores is of little if any value,
              but you may want to set it to something less. You can also  control  this  number  by  setting  an
              environment  variable, INFERNAL_NCPU.  This option is only available if Infernal was compiled with
              POSIX threads support. This is the default, but it may have been turned off  at  compile-time  for
              your site or machine for some reason.

       --stall
              For  debugging the MPI master/worker version: pause after start, to enable the developer to attach
              debuggers to the running master and worker(s) processes. Send SIGCONT signal to release the pause.
              (Under  gdb: (gdb) signal SIGCONT) (Only available if optional MPI support was enabled at compile-
              time.)

       --mpi  Run in MPI master/worker mode, using mpirun.  (Only available if optional MPI support was  enabled
              at compile-time.)

SEE ALSO

       See  infernal(1)  for  a  master man page with a list of all the individual man pages for programs in the
       Infernal package.

       For complete documentation, see the user guide that came with your Infernal distribution (Userguide.pdf);
       or see the Infernal web page ().

       Copyright (C) 2014 Howard Hughes Medical Institute.
       Freely distributed under the GNU General Public License (GPLv3).

       For  additional  information  on  copyright and licensing, see the file called COPYRIGHT in your Infernal
       source distribution, or see the Infernal web page ().

AUTHOR

       The Eddy/Rivas Laboratory
       Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147 USA
       http://eddylab.org