bionic (1) cmscan.1.gz

Provided by: infernal_1.1.2-1_amd64 bug

NAME

       cmscan - search sequence(s) against a covariance model database

SYNOPSIS

       cmscan [options] <cmdb> <seqfile>

DESCRIPTION

       cmscan  is  used  to  search  sequences  against  collections of covariance models.  For each sequence in
       <seqfile>, use that query sequence to search the target database of CMs  in  <cmdb>,  and  output  ranked
       lists of the CMs with the most significant matches to the sequence.

       The  <seqfile>  may  contain  more  than  one query sequence. It can be in FASTA format, or several other
       common sequence file formats (genbank, embl, and among others), or in alignment file formats  (stockholm,
       aligned fasta, and others). See the --qformat option for a complete list.

       The  <cmdb>  needs to be press'ed using cmpress before it can be searched with cmscan.  This creates four
       binary files, suffixed .i1{fimp}.  Additionally, <cmdb> must  have  been  calibrated  for  E-values  with
       cmcalibrate before being press'ed with cmpress.

       The  query  <seqfile>  may  be  '-' (a dash character), in which case the query sequences are read from a
       <stdin> pipe instead of from a file.  The <cmdb> cannot be read from a <stdin> stream, because  it  needs
       to have those four auxiliary binary files generated by cmpress.

       The  output  format  is  designed  to  be  human-readable,  but is often so voluminous that reading it is
       impractical, and parsing it is a pain. The --tblout option saves output in a simple tabular  format  that
       is  concise  and  easier to parse. The --fmt 2 option modifies the format of the tabular output by adding
       several fields, including markup of overlapping hits, as described in section  6  of  the  Infernal  user
       guide.  The -o option allows redirecting the main output, including throwing it away in /dev/null.

       cmscan reexamines the 5' and 3' termini of target sequences using specialized algorithms for detection of
       truncated hits, in which part of the 5' and/or 3' end of the actual full length  homologous  sequence  is
       missing in the target sequence file. These types of hits will be most common in sequence files consisting
       of unassembled sequencing reads. By default, any 5' truncated  hit  is  required  to  include  the  first
       residue  of  the  target  sequence  it derives from in <seqfile>, and any 3' truncated hit is required to
       include the final residue of the target sequence it derives from.  Any  5'  and  3'  truncated  hit  must
       include  the  first  and final residue of the target sequence it derives from. The --anytrunc option will
       relax the requirements for hit inclusion of sequence endpoints, and truncated hits are allowed  to  start
       and stop at any positions of target sequences.  Importantly though, with --anytrunc, hit E-values will be
       less accurate because model calibration does not consider the possibility of truncated hits,  so  use  it
       with  caution.   The  --notrunc  option  can be used to turn off truncated hit detection.  --notrunc will
       reduce the running time of cmscan, most significantly for target <seqfile> files that include many  short
       sequences.   Truncated  hit  detection  is  automatically  turned  off when the --max, --nohmm, --qdb, or
       --nonbanded options are used because it relies on the use of an accelerated HMM banded alignment strategy
       that is turned off by any of those options.

OPTIONS

       -h     Help; print a brief reminder of command line usage and all available options.

       -g     Turn  on  the  glocal  alignment  algorithm, global with respect to the query model and local with
              respect to the target database. By default, the local alignment algorithm is used which  is  local
              with  respect  to both the target sequence and the model. In local mode, the alignment to span two
              or more subsequences if necessary (e.g. if the structures of the query model and  target  sequence
              are only partially shared), allowing certain large insertions and deletions in the structure to be
              penalized differently than normal indels. Local mode performs better on empirical  benchmarks  and
              is significantly more sensitive for remote homology detection. Empirically, glocal searches return
              many fewer hits than local searches, so glocal may be desired for some applications.

       -Z <x> Calculate E-values as if the search space size was <x> megabases (Mb). Without  the  use  of  this
              option,  the search space size changes for each query sequence, it is defined as the length of the
              current query sequence times 2 (because both strands of the sequence will be searched)  times  the
              number of CMs in <cmdb>.

       --devhelp
              Print  help,  as with -h , but also include expert options that are not displayed with -h .  These
              expert options are not expected to be relevant for the vast majority  of  users  and  so  are  not
              described  in the manual page.  The only resources for understanding what they actually do are the
              brief one-line descriptions output when --devhelp is enabled, and the source code.

OPTIONS FOR CONTROLLING OUTPUT

       -o <f> Direct the main human-readable output to a file <f> instead of the default stdout.

       --tblout <f>
              Save a simple tabular (space-delimited) file summarizing the hits found, with one  data  line  per
              hit.  The format of this file is described in section 6 of the Infernal user guide.

       --fmt <n>
              specify  the  format  of  the  tabular  output  file specified with --tblout <f> be in format <n>.
              Possible values for <n> are 1 or 2. By default <n> is 1 when --tblout is used without --fmt.  With
              --fmt  2 nine additional fields are added to the tabular output file, most of which pertain to the
              annotation of overlapping hits.  See section 6 the Infernal user guide for a description  of  both
              formats.

       --acc  Use accessions instead of names in the main output, where available for profiles and/or sequences.

       --noali
              Omit the alignment section from the main output. This can greatly reduce the output volume.

       --notextw
              Unlimit  the  length of each line in the main output. The default is a limit of 120 characters per
              line, which helps in displaying the output cleanly on terminals and in editors, but  can  truncate
              target profile description lines.

       --textw <n>
              Set the main output's line length limit to <n> characters per line. The default is 120.

       --verbose
              Include  extra search pipeline statistics in the main output, including filter survival statistics
              for truncated hit detection and number of envelopes discarded due to matrix size overflows.

OPTIONS CONTROLLING REPORTING THRESHOLDS

       Reporting thresholds control which hits are reported in output files (the main output and --tblout)  Hits
       are  ranked  by  statistical  significance  (E-value).   By  default,  all hits with an E-value <= 10 are
       reported.  The following options allow you to change the default E-value reporting thresholds, or to  use
       bit score thresholds instead.

       -E <x> In the per-target output, report target sequences with an E-value of <= <x>.  The default is 10.0,
              meaning that on average, about 10 false positives will be reported per query, so you can  see  the
              top of the noise and decide for yourself if it's really noise.

       -T <x> Instead  of  thresholding per-CM output on E-value, report target sequences with a bit score of >=
              <x>.

OPTIONS FOR INCLUSION THRESHOLDS

       Inclusion thresholds are stricter than reporting thresholds.  Inclusion thresholds control which hits are
       considered  to  be  reliable  enough  to  be included in a possible subsequent search round, or marked as
       significant ("!") as opposed to questionable ("?") in hit output.

       --incE <x>
              Use an E-value of <= <x> as the hit inclusion threshold.  The default is  0.01,  meaning  that  on
              average,  about  1  false  positive  would  be expected in every 100 searches with different query
              sequences.

       --incT <x>
              Instead of using E-values for setting the inclusion threshold, instead use a bit score of  >=  <x>
              as the hit inclusion threshold.  By default this option is unset.

OPTIONS FOR MODEL-SPECIFIC SCORE THRESHOLDING

       Curated  CM  databases may define specific bit score thresholds for each CM, superseding any thresholding
       based on statistical significance alone.

       To use these options, the profile must contain  the  appropriate  (GA,  TC,  and/or  NC)  optional  score
       threshold  annotation;  this  is  picked  up  by  cmbuild  from  Stockholm  format  alignment files. Each
       thresholding option has a score of <x> bits,  and  acts  as  if  -T  <x>  --incT  <x>  has  been  applied
       specifically using each model's curated thresholds.

       --cut_ga
              Use  the  GA (gathering) bit scores in the model to set hit reporting and inclusion thresholds. GA
              thresholds are generally  considered  to  be  the  reliable  curated  thresholds  defining  family
              membership;  for  example,  in  Rfam,  these  thresholds  define  what  gets included in Rfam Full
              alignments based on searches with Rfam Seed models.

       --cut_nc
              Use the NC (noise cutoff) bit score thresholds in the model to set  hit  reporting  and  inclusion
              thresholds.  NC  thresholds  are generally considered to be the score of the highest-scoring known
              false positive.

       --cut_tc
              Use the TC (trusted cutoff) bit score thresholds in the model to set hit reporting  and  inclusion
              thresholds.  TC  thresholds  are  generally considered to be the score of the lowest-scoring known
              true positive that is above all known false positives.

OPTIONS CONTROLLING THE ACCELERATION PIPELINE

       Infernal searches are accelerated in a six-stage filter pipeline. The first five stages use a profile HMM
       to  define  envelopes  that  are  passed  to  the stage six CM CYK filter. Any envelopes that survive all
       filters are assigned final scores using the the CM Inside algorithm.

       The profile HMM filter is built by the cmbuild program and is stored in <cmfile>.

       Each successive filter is slower than the previous one, but  better  than  it  at  disciminating  between
       subsequences  that  may  contain  high-scoring  CM hits and those that do not. The first three HMM filter
       stages are the same as those used in HMMER3.  Stage 1 (F1) is the local HMM SSV filter modified for  long
       sequences.  Stage  2  (F2) is the local HMM Viterbi filter. Stage 3 (F3) is the local HMM Forward filter.
       Each of the first three stages uses the profile HMM in local mode, which allows a target  subsequence  to
       align  to any region of the HMM. Stage 4 (F4) is a glocal HMM filter, which requires a target subsequence
       to align to the full-length profile HMM. Stage 5 (F5) is the glocal HMM envelope definition filter, which
       uses  HMMER3's domain identification heursitics to define envelope boundaries. After each stage from 2 to
       5 a bias filter step (F2b, F3b, F4b, and F5b) is used to remove sequences that appear to have passed  the
       filter  due to biased composition alone. Any envelopes that survive stages F1 through F5b are then passed
       with the local CM CYK filter. The CYK filter uses constraints (bands) derived from an  HMM  alignment  of
       the  envelope  to  reduce the number of required calculations and save time.  Any envelopes that pass CYK
       are scored with the local CM Inside algorithm, again using HMM bands for acceleration.

       The default filter thresholds that define the minimum score required for a subsequence  to  survive  each
       stage  are  defined  based  on  the  size  of the search space (Z), which is defined as the length of the
       current query sequence times 2 (because both strands will be searched) times the number  of  profiles  in
       <cmdb>.   However,  if  either  the  -Z  <x>  or  --FZ <x> options are used then the search space will be
       considered to be <x> for purposes of defining the filter thresholds.

       For larger databases, the filters are more strict leading to more acceleration but potentially a  greater
       loss  of sensitivity. The rationale is that for larger databases, hits must have higher scores to achieve
       statistical significance, so  stricter  filtering  that  removes  lower  scoring  insignificant  hits  is
       acceptable.

       The  P-value  thresholds for all possible search space sizes and all filter stages are listed next. (A P-
       value threshold of 0.01 means that roughly 1%  of  the  highest  scoring  nonhomologous  subsequence  are
       expected  to  pass the filter.) Z is defined as the number of nucleotides in the complete target sequence
       file times 2 because both strands will be searched with each model.

       If Z is less than 2 Mb: F1 is 0.35; F2 and F2b are off; F3, F3b, F4, F4b and F5 are 0.02; F6 is 0.0001.

       If Z is between 2 Mb and 20 Mb: F1 is 0.35; F2 and F2b are off; F3, F3b, F4, F4b and F5 are 0.005; F6  is
       0.0001.

       If  Z is between 20 Mb and 200 Mb: F1 is 0.35; F2 and F2b are 0.15; F3, F3b, F4, F4b and F5 are 0.003; F6
       is 0.0001.

       If Z is between 200 Mb and 2 Gb: F1 is 0.15; F2 and F2b are 0.15; F3, F3b,  F4,  F4b,  F5,  and  F5b  are
       0.0008; and F6 is 0.0001.

       If  Z  is  between  2  Gb  and  20 Gb: F1 is 0.15; F2 and F2b are 0.15; F3, F3b, F4, F4b, F5, and F5b are
       0.0002; and F6 is 0.0001.

       If Z is more than 20 Gb: F1 is 0.06; F2 and F2b are 0.02; F3, F3b, F4, F4b, F5, and F5b are  0.0002;  and
       F6 is 0.0001.

       These  thresholds  were  chosen  based  on  performance  on  an internal benchmark testing many different
       possible settings.

       There are five options for controlling the general filtering level. These  options  are,  in  order  from
       least  strict  (slowest but most sensitive) to most strict (fastest but least sensitive): --max, --nohmm,
       --mid, --default, (this is the default setting)  --rfam.   and  --hmmonly.   With  --default  the  filter
       thresholds will be database-size dependent. See the explanation of each of these individual options below
       for more information.

       Additionally, an expert user can precisely control each filter  stage  score  threshold  with  the  --F1,
       --F1b,  --F2,  --F2b, --F3, --F3b, --F4, --F4b, --F5, --F5b, and --F6 options. As well as turn each stage
       on or off with the --noF1, --doF1b, --noF2,  --noF2b,  --noF3,  --noF3b,  --noF4,  --noF4b,  --noF5,  and
       --noF6.  options.  These options are only displayed if the --devhelp option is used to keep the number of
       displayed options with -h reasonable, and because they are only expected to be useful to a small minority
       of users.

       As  a  special  case,  for any models in <cmfile> which have zero basepairs, profile HMM searches are run
       instead of CM searches. HMM algorithms are more efficient than CM  algorithms,  and  the  benefit  of  CM
       algorithms  is  lost  for models with no secondary structure (zero basepairs). These profile HMM searches
       will run significantly faster than the CM searches. You can force HMM-only searches  with  the  --hmmonly
       option. For more information on HMM-only searches see the user guide.

       --max  Turn  off  all  filters,  and  run  non-banded  Inside  on every full-length target sequence. This
              increases sensitivity somewhat, at an extremely large cost in speed.

       --nohmm
              Turn off all HMM filter stages (F1 through F5b). The CYK filter, using QDBs, will be run on  every
              full-length  target sequence and will enforce a P-value threshold of 0.0001. Each subsequence that
              survives CYK will be passed to Inside, which will also use QDBs (but a looser set). This increases
              sensitivity somewhat, at a very large cost in speed.

       --mid  Turn  off  the  HMM  SSV  and  Viterbi  filter  stages (F1 through F2b).  Set remaining HMM filter
              thresholds (F3 through F5b) to 0.02 by default, but changeable to <x> with  --Fmid  <x>  sequence.
              This may increase sensitivity, at a significant cost in speed.

       --default
              Use  the  default  filtering  strategy.  This  option  is on by default. The filter thresholds are
              determined based on the database size.

       --rfam Use a strict filtering strategy  devised  for  large  databases  (more  than  20  Gb).  This  will
              accelerate the search at a potential cost to sensitivity.

       --hmmonly
              Only use the filter profile HMM for searches, do not use the CM.  Only filter stages F1 through F3
              will be executed, using strict P-value thresholds (0.02 for F1, 0.001 for F2 and 0.00001 for  F3).
              Additionally  a  bias  composition  filter  is  used  after  the  F1  stage  (with P=0.02 survival
              threshold).  Any hit that survives all stages and has an  HMM  E-value  or  bit  score  above  the
              reporting  threshold  will  be  output.   The  user  can change the HMM-only filter thresholds and
              options with --hmmF1, --hmmF2, --hmmF3, --hmmnobias,  --hmmnonull2,  and  --hmmmax.   By  default,
              searches  for  any model with zero basepairs will be run in HMM-only mode. This can be turned off,
              forcing CM searches for these models with the --nohmmonly option.

       --FZ <x>
              Set filter thresholds as the defaults used if the database were <x> megabases (Mb). If  used  with
              <x> greater than 20000 (20 Gb) this option has the same effect as --rfam.

       --Fmid <x>
              With  the  --mid option set the HMM filter thresholds (F3 through F5b) to <x>.  By default, <x> is
              0.02.

OTHER OPTIONS

       --notrunc
              Turn off truncated hit detection.

       --anytrunc
              Allow truncated hits to begin and end at any  position  in  a  target  sequence.  By  default,  5'
              truncated  hits must include the first residue of their target sequence and 3' truncated hits must
              include the final residue of their target sequence. With this option you may  observe  fewer  full
              length hits that extend to the beginning and end of the query CM.

       --nonull3
              Turn off the null3 CM score corrections for biased composition. This correction is not used during
              the HMM filter stages.

       --mxsize <x>
              Set the maximum allowable CM DP matrix size to <x> megabytes. By default  this  size  is  128  Mb.
              This should be large enough for the vast majority of searches, especially with smaller models.  If
              cmscan encounters an envelope in the CYK or Inside  stage  that  requires  a  larger  matrix,  the
              envelope  will  be  discounted from consideration. This behavior is like an additional filter that
              prevents expensive (slow) CM DP calculations, but at a potential cost to sensitivity.   Note  that
              if cmscan is being run in <n> multiple threads on a multicore machine then each thread may have an
              allocated matrix of up to size <x> Mb at any given time.

       --smxsize <x>
              Set the maximum allowable CM search DP matrix size to <x> megabytes. By default this size  is  128
              Mb.   This  option is only relevant if the CM will not use HMM banded matrices, i.e. if the --max,
              --nohmm, --qdb, --fqdb, --nonbanded, or --fnonbanded options are also used. Note that if  cmsearch
              is being run in <n> multiple threads on a multicore machine then each thread may have an allocated
              matrix of up to size <x> Mb at any given time.

       --cyk  Use the CYK algorithm, not Inside, to determine the final score of all hits.

       --acyk Use the CYK algorithm to align hits. By default, the Durbin/Holmes optimal accuracy  algorithm  is
              used, which finds the alignment that maximizes the expected accuracy of all aligned residues.

       --wcx <x>
              For each CM, set the W parameter, the expected maximum length of a hit, to <x> times the consensus
              length of the model. By default, the W parameter is read from the CM file and was calculated based
              on  the  transition probabilities of the model by cmbuild.  You can find out what the default W is
              for a model using cmstat.  This option should be used with caution as  it  impacts  the  filtering
              pipeline  at  several different stages in nonobvious ways. It is only recommended for expert users
              searching for hits that are much longer than any of the  homologs  used  to  build  the  model  in
              cmbuild, e.g. ones with large introns or other large insertions.  It cannot be used in combination
              with the --nohmm, --fqdb or --qdb options because in those cases W is limited  by  query-dependent
              bands.

       --toponly
              Only  search  the  top (Watson) strand of target sequences in <seqfile>.  By default, both strands
              are searched. This will halve the search space size (Z).

       --bottomonly
              Only search the bottom (Crick) strand of target sequences in <seqfile>.  By default, both  strands
              are searched. This will halve the search space size (Z).

       --qformat <s>
              Assert  that  the  query sequence database file is in format <s>.  Accepted formats include fasta,
              embl, genbank, ddbj, stockholm, pfam, a2m, afa, clustal, and phylip The default is  to  autodetect
              the format of the file.

       --glist <f>
              Configure a subset of models from <cmfile> in glocal alignment mode, instead of local mode, namely
              the models listed in file <f>.  Configure all other models (those not  listed  in  <f>)  in  local
              mode.   This  option  is  incompatible  with  -g.   File  <f> must list valid names of models from
              <cmfile>, each separated by any whitespace character (e.g. a newline character).

       --clanin <f>
              Read clan information on the models in <cmfile> from file <f>.  Not all models in <cmfile> need to
              be  a member of a clan.  This option must be used in combination with --fmt 2 and --tblout because
              clan annotation is only output in format 2 of the tabular output  file.   See  section  9  of  the
              Infernal user guide for specifications on the format of the clan input file <f>.

       --oclan
              Only  mark overlaps between models in the same clan.  This option must be used in combination with
              --fmt 2 , --tblout and --clanin because clan annotation is only output in format 2 of the  tabular
              output file, and clan information can only be input using the --clanin option.

       --oskip <f>
              Omit any hit h from the tabular output file that satisifies the following: another hit h2 overlaps
              with h and the E-value of h2 is lower than that of h. Hit h will not appear in the tabular  output
              file,  although  it  will  still  exist  in  the  standard  output.   This  option must be used in
              combination with --fmt 2 --tblout because overlap annotation is only output in  format  2  of  the
              tabular output file.  When used in combination with --oclan only hits h that satisfy the following
              are omitted: another hit h2 overlaps with h, the E-value of h2 is lower than that of h, and both h
              and h2 are hits to models that are in the same clan.

       --cpu <n>
              Set the number of parallel worker threads to <n>.  By default, Infernal sets this to the number of
              CPU cores it detects in your machine - that is, it tries to maximize the  use  of  your  available
              processor  cores. Setting <n> higher than the number of available cores is of little if any value,
              but you may want to set it to something less. You can also  control  this  number  by  setting  an
              environment  variable, INFERNAL_NCPU.  This option is only available if Infernal was compiled with
              POSIX threads support. This is the default, but it may have been turned off  at  compile-time  for
              your site or machine for some reason.

       --stall
              For  debugging the MPI master/worker version: pause after start, to enable the developer to attach
              debuggers to the running master and worker(s) processes. Send SIGCONT signal to release the pause.
              (Under  gdb: (gdb) signal SIGCONT) (Only available if optional MPI support was enabled at compile-
              time.)

       --mpi  Run in MPI master/worker mode, using mpirun.  (Only available if optional MPI support was  enabled
              at compile-time.)

SEE ALSO

       See  infernal(1)  for  a  master man page with a list of all the individual man pages for programs in the
       Infernal package.

       For complete documentation, see the user guide that came with your Infernal distribution (Userguide.pdf);
       or see the Infernal web page ().

       Copyright (C) 2016 Howard Hughes Medical Institute.
       Freely distributed under a BSD open source license.

       For  additional  information  on  copyright and licensing, see the file called COPYRIGHT in your Infernal
       source distribution, or see the Infernal web page ().

AUTHOR

       The Eddy/Rivas Laboratory
       Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147 USA
       http://eddylab.org