Ubuntu Manpage: hmmscan - search sequence(s) against a profile database

NAME

       hmmscan - search sequence(s) against a profile database

SYNOPSIS

       hmmscan [options] hmmdb seqfile

DESCRIPTION

       hmmscan  is  used to search protein sequences against collections of protein profiles. For
       each sequence in seqfile, use that  query  sequence  to  search  the  target  database  of
       profiles  in  hmmdb,  and  output  ranked  lists of the profiles with the most significant
       matches to the sequence.

       The seqfile may contain more than one query  sequence.  Each  will  be  searched  in  turn
       against hmmdb.

       The  hmmdb  needs  to  be  press'ed using hmmpress before it can be searched with hmmscan.
       This creates four binary files, suffixed .h3{fimp}.

       The query seqfile may be '-' (a dash character), in which case  the  query  sequences  are
       read  from  a  stdin  pipe  instead of from a file.  The hmmdb cannot be read from a stdin
       stream, because it needs to have those four auxiliary binary files generated by hmmpress.

       The output format is designed to be  human-readable,  but  is  often  so  voluminous  that
       reading  it is impractical, and parsing it is a pain. The --tblout and --domtblout options
       save output in simple tabular formats that are concise and easier to parse.  The -o option
       allows redirecting the main output, including throwing it away in /dev/null.

OPTIONS

       -h     Help; print a brief reminder of command line usage and all available options.

OPTIONS FOR CONTROLLING OUTPUT

       -o <f> Direct the main human-readable output to a file <f> instead of the default stdout.

       --tblout <f>
              Save  a  simple  tabular  (space-delimited) file summarizing the per-target output,
              with one data line per homologous target model found.

       --domtblout <f>
              Save a simple tabular (space-delimited) file  summarizing  the  per-domain  output,
              with  one  data  line  per  homologous domain detected in a query sequence for each
              homologous model.

       --pfamtblout <f>
              Save an especially succinct tabular (space-delimited)  file  summarizing  the  per-
              target output, with one data line per homologous target model found.

       --acc  Use  accessions  instead  of names in the main output, where available for profiles
              and/or sequences.

       --noali
              Omit the alignment section from the main output. This can greatly reduce the output
              volume.

       --notextw
              Unlimit  the  length of each line in the main output. The default is a limit of 120
              characters per line, which helps in displaying the output cleanly on terminals  and
              in editors, but can truncate target profile description lines.

       --textw <n>
              Set  the main output's line length limit to <n> characters per line. The default is
              120.

OPTIONS FOR REPORTING THRESHOLDS

       Reporting thresholds control which hits are reported in output  files  (the  main  output,
       --tblout, and --domtblout).

       -E <x> In  the  per-target  output, report target profiles with an E-value of <= <x>.  The
              default is 10.0, meaning that on average, about 10 false positives will be reported
              per  query,  so  you  can  see the top of the noise and decide for yourself if it's
              really noise.

       -T <x> Instead of thresholding  per-profile  output  on  E-value,  instead  report  target
              profiles with a bit score of >= <x>.

       --domE <x>
              In  the per-domain output, for target profiles that have already satisfied the per-
              profile reporting threshold, report individual domains with a  conditional  E-value
              of  <=  <x>.  The default is 10.0.  A conditional E-value means the expected number
              of additional  false  positive  domains  in  the  smaller  search  space  of  those
              comparisons  that  already  satisfied the per-profile reporting threshold (and thus
              must have at least one homologous domain already).

       --domT <x>
              Instead of thresholding per-domain output on E-value, instead report domains with a
              bit score of >= <x>.

OPTIONS FOR INCLUSION THRESHOLDS

       Inclusion thresholds are stricter than reporting thresholds.  Inclusion thresholds control
       which hits are considered to be reliable enough to be included in an output alignment or a
       subsequent  search  round.   In  hmmscan,  which  does not have any alignment output (like
       hmmsearch or phmmer) nor any iterative search steps (like jackhmmer), inclusion thresholds
       have  little  effect.  They  only  affect  what  domains  get marked as significant (!) or
       questionable (?) in domain output.

       --incE <x>
              Use an E-value of <= <x> as the per-target inclusion  threshold.   The  default  is
              0.01,  meaning  that  on average, about 1 false positive would be expected in every
              100 searches with different query sequences.

       --incT <x>
              Instead of using E-values for setting the inclusion threshold, instead  use  a  bit
              score  of >= <x> as the per-target inclusion threshold.  It would be unusual to use
              bit score thresholds  with  hmmscan,  because  you  don't  expect  a  single  score
              threshold  to  work  for  different  profiles;  different  profiles  have  slightly
              different expected score distributions.

       --incdomE <x>
              Use a conditional E-value of <= <x>  as  the  per-domain  inclusion  threshold,  in
              targets  that  have  already  satisfied the overall per-target inclusion threshold.
              The default is 0.01.

       --incdomT <x>
              Instead of using E-values, instead use a bit score of  >=  <x>  as  the  per-domain
              inclusion threshold.  As with --incT above, it would be unusual to use a single bit
              score threshold in hmmscan.

OPTIONS FOR MODEL-SPECIFIC SCORE THRESHOLDING

       Curated profile databases may define specific  bit  score  thresholds  for  each  profile,
       superseding any thresholding based on statistical significance alone.

       To  use  these  options,  the  profile  must  contain  the appropriate (GA, TC, and/or NC)
       optional score threshold annotation; this is picked up by hmmbuild from  Stockholm  format
       alignment  files. Each thresholding option has two scores: the per-sequence threshold <x1>
       and the per-domain threshold <x2>.  These act as  if  -T  <x1>  --incT  <x1>  --domT  <x2>
       --incdomT <x2> has been applied specifically using each model's curated thresholds.

       --cut_ga
              Use  the  GA (gathering) bit scores in the model to set per-sequence (GA1) and per-
              domain (GA2) reporting  and  inclusion  thresholds.  GA  thresholds  are  generally
              considered  to  be  the reliable curated thresholds defining family membership; for
              example, in  Pfam,  these  thresholds  define  what  gets  included  in  Pfam  Full
              alignments based on searches with Pfam Seed models.

       --cut_nc
              Use  the  NC  (noise  cutoff) bit score thresholds in the model to set per-sequence
              (NC1) and per-domain (NC2) reporting and inclusion thresholds.  NC  thresholds  are
              generally considered to be the score of the highest-scoring known false positive.

       --cut_tc
              Use  the  NC (trusted cutoff) bit score thresholds in the model to set per-sequence
              (TC1) and per-domain (TC2) reporting and inclusion thresholds.  TC  thresholds  are
              generally considered to be the score of the lowest-scoring known true positive that
              is above all known false positives.

CONTROL OF THE ACCELERATION PIPELINE

HMMER3 searches are accelerated in a three-step filter pipeline: the MSV filter, the
Viterbi filter, and the Forward filter. The first filter is the fastest and most
approximate; the last is the full Forward scoring algorithm. There is also a bias filter
step between MSV and Viterbi. Targets that pass all the steps in the acceleration pipeline
are then subjected to postprocessing -- domain identification and scoring using the
Forward/Backward algorithm.

Changing filter thresholds only removes or includes targets from consideration; changing
filter thresholds does not alter bit scores, E-values, or alignments, all of which are
determined solely in postprocessing.

--max Turn off all filters, including the bias filter, and run full Forward/Backward
postprocessing on every target. This increases sensitivity somewhat, at a large
cost in speed.

--F1 <x>
Set the P-value threshold for the MSV filter step. The default is 0.02, meaning
that roughly 2% of the highest scoring nonhomologous targets are expected to pass
the filter.

--F2 <x>
Set the P-value threshold for the Viterbi filter step. The default is 0.001.

--F3 <x>
Set the P-value threshold for the Forward filter step. The default is 1e-5.

--nobias
Turn off the bias filter. This increases sensitivity somewhat, but can come at a
high cost in speed, especially if the query has biased residue composition (such as
a repetitive sequence region, or if it is a membrane protein with large regions of
hydrophobicity). Without the bias filter, too many sequences may pass the filter
with biased queries, leading to slower than expected performance as the
computationally intensive Forward/Backward algorithms shoulder an abnormally heavy
load.

OTHER OPTIONS

       --nonull2
              Turn off the null2 score corrections for biased composition.

       -Z <x> Assert  that  the total number of targets in your searches is <x>, for the purposes
              of per-sequence E-value calculations, rather than  the  actual  number  of  targets
              seen.

       --domZ <x>
              Assert  that  the total number of targets in your searches is <x>, for the purposes
              of per-domain conditional E-value calculations, rather than the number  of  targets
              that passed the reporting thresholds.

       --seed <n>
              Set  the  random  number  seed  to <n>.  Some steps in postprocessing require Monte
              Carlo simulation.  The default is to use a fixed seed (42),  so  that  results  are
              exactly  reproducible.  Any  other  positive  integer will give different (but also
              reproducible) results. A choice of 0 uses an arbitrarily chosen seed.

       --qformat <s>
              Assert that input seqfile is in format <s>, bypassing format autodetection.  Common
              choices for <s> include: fasta, embl, genbank.  Alignment formats also work; common
              choices  include:  stockholm,  a2m,  afa,  psiblast,  clustal,  phylip.   For  more
              information,  and  for  codes for some less common formats, see main documentation.
              The string <s> is case-insensitive (fasta or FASTA both work).

       --cpu <n>
              Set the number of parallel worker threads to <n>.  The default is  0,  meaning  off
              (no  thread-level  parallelization), because hmmscan is typically i/o bound and the
              extra overhead of our current multithreaded implementation isn't  worthwhile.   You
              can also control this number by setting an environment variable, HMMER_NCPU.  There
              is also a master thread, so the actual number of threads that HMMER  spawns  is  at
              least <n>+1.

              This  option  is  not  available  if  HMMER was compiled with POSIX threads support
              turned off.

       --stall
              For debugging the MPI master/worker version:  pause  after  start,  to  enable  the
              developer  to  attach debuggers to the running master and worker(s) processes. Send
              SIGCONT signal to release the pause.  (Under gdb: (gdb) signal SIGCONT)

              (Only available if optional MPI support was enabled at compile-time.)

       --mpi  Run under  MPI  control  with  master/worker  parallelization  (using  mpirun,  for
              example,  or  equivalent).  Only  available  if optional MPI support was enabled at
              compile-time.

COPYRIGHT

       Copyright (C) 2023 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For additional information on copyright and licensing, see the file  called  COPYRIGHT  in
       your HMMER source distribution, or see the HMMER web page (http://hmmer.org/).

AUTHOR

       http://eddylab.org