lunar (1) cmscan.1.gz

Provided by: infernal_1.1.4-1_amd64 bug

NAME

       cmscan - search sequence(s) against a covariance model database

SYNOPSIS

       cmscan [options] <cmdb> <seqfile>

DESCRIPTION

       cmscan  is  used  to  search sequences against collections of covariance models.  For each
       sequence in <seqfile>, use that query sequence to search the target  database  of  CMs  in
       <cmdb>,  and  output  ranked  lists  of  the  CMs with the most significant matches to the
       sequence.

       The <seqfile> may contain more than one query sequence. It can  be  in  FASTA  format,  or
       several  other  common  sequence  file  formats  (genbank,  embl, and among others), or in
       alignment file formats (stockholm, aligned fasta, and others). See  the  --qformat  option
       for a complete list.

       The <cmdb> needs to be press'ed using cmpress before it can be searched with cmscan.  This
       creates four binary files,  suffixed  .i1{fimp}.   Additionally,  <cmdb>  must  have  been
       calibrated for E-values with cmcalibrate before being press'ed with cmpress.

       The  query  <seqfile> may be '-' (a dash character), in which case the query sequences are
       read from a <stdin> pipe instead of from a file.  The <cmdb> cannot be read from a <stdin>
       stream, because it needs to have those four auxiliary binary files generated by cmpress.

       The  output  format  is  designed  to  be  human-readable, but is often so voluminous that
       reading it is impractical, and parsing it is a pain. The --tblout option saves output in a
       simple tabular format that is concise and easier to parse. The --fmt 2 option modifies the
       format of the tabular output by adding several fields,  including  markup  of  overlapping
       hits,  as  described  in  section  6  of  the  Infernal  user guide.  The -o option allows
       redirecting the main output, including throwing it away in /dev/null.

       cmscan reexamines the 5' and 3' termini of target sequences using  specialized  algorithms
       for  detection of truncated hits, in which part of the 5' and/or 3' end of the actual full
       length homologous sequence is missing in the target sequence file.  These  types  of  hits
       will  be  most  common  in  sequence  files consisting of unassembled sequencing reads. By
       default, any 5' truncated hit is required to include  the  first  residue  of  the  target
       sequence it derives from in <seqfile>, and any 3' truncated hit is required to include the
       final residue of the target sequence it derives from. Any 5' and  3'  truncated  hit  must
       include the first and final residue of the target sequence it derives from. The --anytrunc
       option will relax the requirements for hit inclusion of sequence endpoints, and  truncated
       hits  are  allowed  to  start  and stop at any positions of target sequences.  Importantly
       though, with --anytrunc, hit E-values will be less accurate because model calibration does
       not  consider  the  possibility  of truncated hits, so use it with caution.  The --notrunc
       option can be used to turn off truncated hit detection.  --notrunc will reduce the running
       time  of  cmscan,  most  significantly  for target <seqfile> files that include many short
       sequences.  Truncated hit detection is automatically turned off when the  --max,  --nohmm,
       --qdb,  or --nonbanded options are used because it relies on the use of an accelerated HMM
       banded alignment strategy that is turned off by any of those options.

OPTIONS

       -h     Help; print a brief reminder of command line usage and all available options.

       -g     Turn on the glocal alignment algorithm, global with respect to the query model  and
              local  with  respect  to  the  target  database.  By  default,  the local alignment
              algorithm is used which is local with respect to both the target sequence  and  the
              model.  In  local mode, the alignment to span two or more subsequences if necessary
              (e.g. if the structures of the query model and target sequence are  only  partially
              shared),  allowing  certain  large  insertions and deletions in the structure to be
              penalized differently than normal indels. Local mode performs better  on  empirical
              benchmarks  and  is  significantly  more  sensitive  for remote homology detection.
              Empirically, glocal searches return many fewer hits than local searches, so  glocal
              may be desired for some applications.

       -Z <x> Calculate  E-values as if the search space size was <x> megabases (Mb). Without the
              use of this option, the search space size changes for each query  sequence,  it  is
              defined  as  the length of the current query sequence times 2 (because both strands
              of the sequence will be searched) times the number of CMs in <cmdb>.

       --devhelp
              Print help, as with -h , but also include expert options  that  are  not  displayed
              with  -h  .   These  expert  options  are  not expected to be relevant for the vast
              majority of users and so are not described in the manual page.  The only  resources
              for  understanding what they actually do are the brief one-line descriptions output
              when --devhelp is enabled, and the source code.

OPTIONS FOR CONTROLLING OUTPUT

       -o <f> Direct the main human-readable output to a file <f> instead of the default stdout.

       --tblout <f>
              Save a simple tabular (space-delimited) file summarizing the hits found,  with  one
              data  line  per  hit.   The  format  of  this file is described in section 6 of the
              Infernal user guide.

       --fmt <n>
              specify the format of the tabular output file specified with  --tblout  <f>  be  in
              format  <n>.  Possible values for <n> are 1 or 2. By default <n> is 1 when --tblout
              is used without --fmt.  With --fmt 2  nine  additional  fields  are  added  to  the
              tabular  output  file, most of which pertain to the annotation of overlapping hits.
              See section 6 the Infernal user guide for a description of both formats.

       --acc  Use accessions instead of names in the main output, where  available  for  profiles
              and/or sequences.

       --noali
              Omit the alignment section from the main output. This can greatly reduce the output
              volume.

       --notextw
              Unlimit the length of each line in the main output. The default is a limit  of  120
              characters  per line, which helps in displaying the output cleanly on terminals and
              in editors, but can truncate target profile description lines.

       --textw <n>
              Set the main output's line length limit to <n> characters per line. The default  is
              120.

       --verbose
              Include  extra  search  pipeline  statistics  in  the main output, including filter
              survival statistics for truncated hit detection and number of  envelopes  discarded
              due to matrix size overflows.

OPTIONS CONTROLLING REPORTING THRESHOLDS

       Reporting  thresholds control which hits are reported in output files (the main output and
       --tblout) Hits are ranked by statistical significance (E-value).   By  default,  all  hits
       with an E-value <= 10 are reported.  The following options allow you to change the default
       E-value reporting thresholds, or to use bit score thresholds instead.

       -E <x> In the per-target output, report target sequences with an E-value of <=  <x>.   The
              default is 10.0, meaning that on average, about 10 false positives will be reported
              per query, so you can see the top of the noise and  decide  for  yourself  if  it's
              really noise.

       -T <x> Instead  of  thresholding  per-CM output on E-value, report target sequences with a
              bit score of >= <x>.

OPTIONS FOR INCLUSION THRESHOLDS

       Inclusion thresholds are stricter than reporting thresholds.  Inclusion thresholds control
       which  hits  are  considered to be reliable enough to be included in a possible subsequent
       search round, or marked as significant ("!") as  opposed  to  questionable  ("?")  in  hit
       output.

       --incE <x>
              Use  an  E-value  of  <=  <x> as the hit inclusion threshold.  The default is 0.01,
              meaning that on average, about 1 false positive would  be  expected  in  every  100
              searches with different query sequences.

       --incT <x>
              Instead  of  using  E-values for setting the inclusion threshold, instead use a bit
              score of >= <x> as the hit inclusion threshold.  By default this option is unset.

OPTIONS FOR MODEL-SPECIFIC SCORE THRESHOLDING

       Curated CM databases may define specific bit score thresholds for each CM, superseding any
       thresholding based on statistical significance alone.

       To  use  these  options,  the  profile  must  contain  the appropriate (GA, TC, and/or NC)
       optional score threshold annotation; this is picked up by cmbuild  from  Stockholm  format
       alignment  files.  Each thresholding option has a score of <x> bits, and acts as if -T <x>
       --incT <x> has been applied specifically using each model's curated thresholds.

       --cut_ga
              Use the GA (gathering) bit scores in the model to set hit reporting  and  inclusion
              thresholds.  GA  thresholds  are  generally  considered  to be the reliable curated
              thresholds defining family membership;  for  example,  in  Rfam,  these  thresholds
              define  what gets included in Rfam Full alignments based on searches with Rfam Seed
              models.

       --cut_nc
              Use the NC (noise cutoff) bit score thresholds in the model to  set  hit  reporting
              and inclusion thresholds. NC thresholds are generally considered to be the score of
              the highest-scoring known false positive.

       --cut_tc
              Use the TC (trusted cutoff) bit score thresholds in the model to set hit  reporting
              and inclusion thresholds. TC thresholds are generally considered to be the score of
              the lowest-scoring known true positive that is above all known false positives.

OPTIONS CONTROLLING THE ACCELERATION PIPELINE

       Infernal searches are accelerated in a six-stage filter pipeline. The  first  five  stages
       use  a profile HMM to define envelopes that are passed to the stage six CM CYK filter. Any
       envelopes that survive all filters are assigned final  scores  using  the  the  CM  Inside
       algorithm.

       The profile HMM filter is built by the cmbuild program and is stored in <cmfile>.

       Each  successive  filter  is  slower  than  the  previous  one,  but  better  than  it  at
       disciminating between subsequences that may contain high-scoring CM hits and those that do
       not. The first three HMM filter stages are the same as those used in HMMER3.  Stage 1 (F1)
       is the local HMM SSV filter modified for long sequences. Stage 2 (F2)  is  the  local  HMM
       Viterbi  filter.  Stage  3  (F3)  is the local HMM Forward filter. Each of the first three
       stages uses the profile HMM in local mode, which allows a target subsequence to  align  to
       any  region  of  the  HMM.  Stage  4  (F4) is a glocal HMM filter, which requires a target
       subsequence to align to the full-length profile HMM.  Stage  5  (F5)  is  the  glocal  HMM
       envelope definition filter, which uses HMMER3's domain identification heursitics to define
       envelope boundaries. After each stage from 2 to 5 a bias filter step (F2b, F3b,  F4b,  and
       F5b)  is  used  to  remove  sequences  that appear to have passed the filter due to biased
       composition alone. Any envelopes that survive stages F1 through F5b are then  passed  with
       the  local  CM  CYK  filter.  The  CYK filter uses constraints (bands) derived from an HMM
       alignment of the envelope to reduce the number of required  calculations  and  save  time.
       Any envelopes that pass CYK are scored with the local CM Inside algorithm, again using HMM
       bands for acceleration.

       The default filter thresholds that define the minimum score required for a subsequence  to
       survive each stage are defined based on the size of the search space (Z), which is defined
       as the length of the current  query  sequence  times  2  (because  both  strands  will  be
       searched)  times  the number of profiles in <cmdb>.  However, if either the -Z <x> or --FZ
       <x> options are used then the search space will be considered to be <x>  for  purposes  of
       defining the filter thresholds.

       For  larger  databases,  the  filters  are  more  strict  leading to more acceleration but
       potentially a greater loss of sensitivity. The rationale is  that  for  larger  databases,
       hits  must  have  higher scores to achieve statistical significance, so stricter filtering
       that removes lower scoring insignificant hits is acceptable.

       The P-value thresholds for all possible search space  sizes  and  all  filter  stages  are
       listed  next.  (A  P-value  threshold of 0.01 means that roughly 1% of the highest scoring
       nonhomologous subsequence are expected to pass the filter.) Z is defined as the number  of
       nucleotides  in  the  complete  target  sequence file times 2 because both strands will be
       searched with each model.

       If Z is less than 2 Mb: F1 is 0.35; F2 and F2b are off; F3, F3b, F4, F4b and F5 are  0.02;
       F6 is 0.0001.

       If  Z  is  between 2 Mb and 20 Mb: F1 is 0.35; F2 and F2b are off; F3, F3b, F4, F4b and F5
       are 0.005; F6 is 0.0001.

       If Z is between 20 Mb and 200 Mb: F1 is 0.35; F2 and F2b are 0.15; F3, F3b, F4, F4b and F5
       are 0.003; F6 is 0.0001.

       If  Z  is  between 200 Mb and 2 Gb: F1 is 0.15; F2 and F2b are 0.15; F3, F3b, F4, F4b, F5,
       and F5b are 0.0008; and F6 is 0.0001.

       If Z is between 2 Gb and 20 Gb: F1 is 0.15; F2 and F2b are 0.15; F3, F3b, F4, F4b, F5, and
       F5b are 0.0002; and F6 is 0.0001.

       If  Z  is  more than 20 Gb: F1 is 0.06; F2 and F2b are 0.02; F3, F3b, F4, F4b, F5, and F5b
       are 0.0002; and F6 is 0.0001.

       These thresholds were chosen based on performance on an internal  benchmark  testing  many
       different possible settings.

       There  are five options for controlling the general filtering level. These options are, in
       order from least strict (slowest but most sensitive) to most  strict  (fastest  but  least
       sensitive):  --max,  --nohmm, --mid, --default, (this is the default setting) --rfam.  and
       --hmmonly.  With --default the filter thresholds will be database-size dependent. See  the
       explanation of each of these individual options below for more information.

       Additionally,  an expert user can precisely control each filter stage score threshold with
       the --F1, --F1b, --F2, --F2b, --F3, --F3b, --F4, --F4b, --F5, --F5b, and --F6 options.  As
       well  as  turn  each  stage  on  or off with the --noF1, --doF1b, --noF2, --noF2b, --noF3,
       --noF3b, --noF4, --noF4b, --noF5, and --noF6.  options.  These options are only  displayed
       if  the  --devhelp  option  is  used  to  keep  the  number  of  displayed options with -h
       reasonable, and because they are only expected to be useful to a small minority of users.

       As a special case, for any models in <cmfile>  which  have  zero  basepairs,  profile  HMM
       searches  are  run  instead  of  CM  searches.  HMM  algorithms are more efficient than CM
       algorithms, and the benefit of  CM  algorithms  is  lost  for  models  with  no  secondary
       structure  (zero basepairs). These profile HMM searches will run significantly faster than
       the CM searches. You can force HMM-only searches  with  the  --hmmonly  option.  For  more
       information on HMM-only searches see the user guide.

       --max  Turn  off  all  filters,  and  run  non-banded  Inside  on every full-length target
              sequence. This increases sensitivity somewhat, at an extremely large cost in speed.

       --nohmm
              Turn off all HMM filter stages (F1 through F5b). The CYK filter, using  QDBs,  will
              be run on every full-length target sequence and will enforce a P-value threshold of
              0.0001. Each subsequence that survives CYK will be passed  to  Inside,  which  will
              also  use  QDBs  (but a looser set). This increases sensitivity somewhat, at a very
              large cost in speed.

       --mid  Turn off the HMM SSV and Viterbi filter stages (F1 through F2b).  Set remaining HMM
              filter  thresholds  (F3 through F5b) to 0.02 by default, but changeable to <x> with
              --Fmid <x> sequence. This may increase sensitivity, at a significant cost in speed.

       --default
              Use the default filtering strategy. This  option  is  on  by  default.  The  filter
              thresholds are determined based on the database size.

       --rfam Use a strict filtering strategy devised for large databases (more than 20 Gb). This
              will accelerate the search at a potential cost to sensitivity.

       --hmmonly
              Only use the filter profile HMM for searches, do  not  use  the  CM.   Only  filter
              stages  F1  through  F3 will be executed, using strict P-value thresholds (0.02 for
              F1, 0.001 for F2 and 0.00001 for F3).  Additionally a bias  composition  filter  is
              used  after  the  F1 stage (with P=0.02 survival threshold).  Any hit that survives
              all stages and has an HMM E-value or bit score above the reporting  threshold  will
              be  output.   The  user  can change the HMM-only filter thresholds and options with
              --hmmF1, --hmmF2, --hmmF3, --hmmnobias, --hmmnonull2, and  --hmmmax.   By  default,
              searches  for  any model with zero basepairs will be run in HMM-only mode. This can
              be turned off, forcing CM searches for these models with the --nohmmonly option.

       --FZ <x>
              Set filter thresholds as the defaults used if the database were <x> megabases (Mb).
              If  used  with  <x>  greater  than 20000 (20 Gb) this option has the same effect as
              --rfam.

       --Fmid <x>
              With the --mid option set the HMM filter thresholds (F3 through F5b)  to  <x>.   By
              default, <x> is 0.02.

OTHER OPTIONS

       --notrunc
              Turn off truncated hit detection.

       --anytrunc
              Allow  truncated  hits  to  begin  and end at any position in a target sequence. By
              default, 5' truncated hits must include the first residue of their target  sequence
              and 3' truncated hits must include the final residue of their target sequence. With
              this option you may observe fewer full length hits that extend to the beginning and
              end of the query CM.

       --nonull3
              Turn  off the null3 CM score corrections for biased composition. This correction is
              not used during the HMM filter stages.

       --mxsize <x>
              Set the maximum allowable CM DP matrix size to <x> megabytes. By default this  size
              is  128  Mb.   This  should  be  large  enough  for  the vast majority of searches,
              especially with smaller models.  If cmscan encounters an envelope  in  the  CYK  or
              Inside  stage  that  requires a larger matrix, the envelope will be discounted from
              consideration. This behavior is like an additional filter that  prevents  expensive
              (slow)  CM  DP  calculations, but at a potential cost to sensitivity.  Note that if
              cmscan is being run in <n> multiple threads on a multicore machine then each thread
              may have an allocated matrix of up to size <x> Mb at any given time.

       --smxsize <x>
              Set  the  maximum  allowable  CM search DP matrix size to <x> megabytes. By default
              this size is 128 Mb.  This option is only relevant if  the  CM  will  not  use  HMM
              banded  matrices,  i.e.  if  the  --max,  --nohmm,  --qdb,  --fqdb, --nonbanded, or
              --fnonbanded options are also used. Note that if  cmsearch  is  being  run  in  <n>
              multiple  threads  on  a  multicore  machine then each thread may have an allocated
              matrix of up to size <x> Mb at any given time.

       --cyk  Use the CYK algorithm, not Inside, to determine the final score of all hits.

       --acyk Use the CYK algorithm to align hits. By default, the Durbin/Holmes optimal accuracy
              algorithm  is  used, which finds the alignment that maximizes the expected accuracy
              of all aligned residues.

       --wcx <x>
              For each CM, set the W parameter, the expected maximum length  of  a  hit,  to  <x>
              times  the  consensus length of the model. By default, the W parameter is read from
              the CM file and was calculated based on the transition probabilities of  the  model
              by cmbuild.  You can find out what the default W is for a model using cmstat.  This
              option should be used with caution as it impacts the filtering pipeline at  several
              different  stages  in  nonobvious  ways.  It  is  only recommended for expert users
              searching for hits that are much longer than any of the homologs used to build  the
              model  in  cmbuild,  e.g.  ones  with  large introns or other large insertions.  It
              cannot be used in combination with the --nohmm, --fqdb or --qdb options because  in
              those cases W is limited by query-dependent bands.

       --toponly
              Only  search the top (Watson) strand of target sequences in <seqfile>.  By default,
              both strands are searched. This will halve the search space size (Z).

       --bottomonly
              Only search the bottom  (Crick)  strand  of  target  sequences  in  <seqfile>.   By
              default, both strands are searched. This will halve the search space size (Z).

       --qformat <s>
              Assert  that  the  query sequence database file is in format <s>.  Accepted formats
              include fasta, embl, genbank, ddbj, stockholm, pfam, a2m, afa, clustal, and  phylip
              The default is to autodetect the format of the file.

       --glist <f>
              Configure  a  subset  of  models from <cmfile> in glocal alignment mode, instead of
              local mode, namely the models listed in  file  <f>.   Configure  all  other  models
              (those  not  listed  in  <f>)  in local mode.  This option is incompatible with -g.
              File <f> must list valid names of models  from  <cmfile>,  each  separated  by  any
              whitespace character (e.g. a newline character).

       --clanin <f>
              Read  clan  information on the models in <cmfile> from file <f>.  Not all models in
              <cmfile> need to be a member of a clan.  This option must be  used  in  combination
              with --fmt 2 and --tblout because clan annotation is only output in format 2 of the
              tabular output file.  See section 9 of the Infernal user guide  for  specifications
              on the format of the clan input file <f>.

       --oclan
              Only  mark  overlaps  between models in the same clan.  This option must be used in
              combination with --fmt 2 , --tblout and --clanin because clan  annotation  is  only
              output  in  format  2  of the tabular output file, and clan information can only be
              input using the --clanin option.

       --oskip
              Omit any hit h from the tabular output file that satisfies the  following:  another
              hit  h2  overlaps  with  h and the E-value of h2 is lower than that of h, and h2 is
              itself not omitted. Hit h will not appear in the tabular output file,  although  it
              will  still  exist in the standard output.  This option must be used in combination
              with --fmt 2 --tblout because overlap annotation is only output in format 2 of  the
              tabular  output  file.   When  used  in  combination  with --oclan only hits h that
              satisfy the following are omitted: another hit h2 overlaps with h, the  E-value  of
              h2  is  lower  than that of h, and both h and h2 are hits to models that are in the
              same clan.

       --cpu <n>
              Set the number of parallel worker threads to <n>.  By default, Infernal  sets  this
              to  the  number  of  CPU  cores  it  detects in your machine - that is, it tries to
              maximize the use of your available processor cores. Setting  <n>  higher  than  the
              number  of available cores is of little if any value, but you may want to set it to
              something less. You  can  also  control  this  number  by  setting  an  environment
              variable,  INFERNAL_NCPU.   This  option is only available if Infernal was compiled
              with POSIX threads support. This is the default, but it may have been turned off at
              compile-time for your site or machine for some reason.

       --stall
              For  debugging  the  MPI  master/worker  version:  pause after start, to enable the
              developer to attach debuggers to the running master and worker(s)  processes.  Send
              SIGCONT  signal  to  release  the  pause.   (Under gdb: (gdb) signal SIGCONT) (Only
              available if optional MPI support was enabled at compile-time.)

       --mpi  Run in MPI master/worker mode, using  mpirun.   (Only  available  if  optional  MPI
              support was enabled at compile-time.)

SEE ALSO

       See  infernal(1)  for  a  master  man page with a list of all the individual man pages for
       programs in the Infernal package.

       For complete documentation, see the user guide that came with your  Infernal  distribution
       (Userguide.pdf); or see the Infernal web page (http://eddylab.org/infernal/).

       Copyright (C) 2020 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For  additional  information  on copyright and licensing, see the file called COPYRIGHT in
       your    Infernal    source    distribution,    or    see    the    Infernal    web    page
       (http://eddylab.org/infernal/).

AUTHOR

       http://eddylab.org