Ubuntu Manpage: hmmsearch - search profile(s) against a sequence database

NAME

       hmmsearch - search profile(s) against a sequence database

SYNOPSIS

       hmmsearch [options] hmmfile seqdb

DESCRIPTION

       hmmsearch  is  used  to search one or more profiles against a sequence database.  For each
       profile in hmmfile, use that query profile to search the target database of  sequences  in
       seqdb,  and  output ranked lists of the sequences with the most significant matches to the
       profile.  To build profiles from multiple alignments, see hmmbuild.

       Either the query hmmfile or the target seqdb may be '-' (a dash character), in which  case
       the  query profile or target database input will be read from a stdin pipe instead of from
       a file. Only one input source can come through stdin, not both.  An exception is  that  if
       the  hmmfile  contains  more  than  one  profile query, then seqdb cannot come from stdin,
       because we can't rewind the streaming target database to search it with another profile.

       The output format is designed to be  human-readable,  but  is  often  so  voluminous  that
       reading  it is impractical, and parsing it is a pain. The --tblout and --domtblout options
       save output in simple tabular formats that are concise and easier to parse.  The -o option
       allows redirecting the main output, including throwing it away in /dev/null.

OPTIONS

       -h     Help; print a brief reminder of command line usage and all available options.

OPTIONS FOR CONTROLLING OUTPUT

       -o <f> Direct the main human-readable output to a file <f> instead of the default stdout.

       -A <f> Save  a  multiple  alignment  of  all  significant hits (those satisfying inclusion
              thresholds) to the file <f>.

       --tblout <f>
              Save a simple tabular (space-delimited) file  summarizing  the  per-target  output,
              with one data line per homologous target sequence found.

       --domtblout <f>
              Save  a  simple  tabular  (space-delimited) file summarizing the per-domain output,
              with one data line per homologous domain detected in  a  query  sequence  for  each
              homologous model.

       --acc  Use  accessions  instead  of names in the main output, where available for profiles
              and/or sequences.

       --noali
              Omit the alignment section from the main output. This can greatly reduce the output
              volume.

       --notextw
              Unlimit  the  length of each line in the main output. The default is a limit of 120
              characters per line, which helps in displaying the output cleanly on terminals  and
              in editors, but can truncate target profile description lines.

       --textw <n>
              Set  the main output's line length limit to <n> characters per line. The default is
              120.

OPTIONS CONTROLLING REPORTING THRESHOLDS

       Reporting thresholds control which hits are reported in output  files  (the  main  output,
       --tblout,  and  --domtblout).   Sequence  hits  and  domain hits are ranked by statistical
       significance (E-value) and output is generated in two sections called per-target and  per-
       domain  output.  In per-target output, by default, all sequence hits with an E-value <= 10
       are reported. In the per-domain  output,  for  each  target  that  has  passed  per-target
       reporting thresholds, all domains satisfying per-domain reporting thresholds are reported.
       By default, these are domains with conditional E-values of <= 10.  The  following  options
       allow  you  to  change  the  default  E-value  reporting  thresholds,  or to use bit score
       thresholds instead.

       -E <x> In the per-target output, report target sequences with an E-value of <=  <x>.   The
              default is 10.0, meaning that on average, about 10 false positives will be reported
              per query, so you can see the top of the noise and  decide  for  yourself  if  it's
              really noise.

       -T <x> Instead  of  thresholding  per-profile  output  on  E-value,  instead report target
              sequences with a bit score of >= <x>.

       --domE <x>
              In the per-domain output, for target sequences that have already satisfied the per-
              profile  reporting  threshold, report individual domains with a conditional E-value
              of <= <x>.  The default is 10.0.  A conditional E-value means the  expected  number
              of  additional  false  positive  domains  in  the  smaller  search  space  of those
              comparisons that already satisfied the per-target  reporting  threshold  (and  thus
              must have at least one homologous domain already).

       --domT <x>
              Instead of thresholding per-domain output on E-value, instead report domains with a
              bit score of >= <x>.

OPTIONS FOR INCLUSION THRESHOLDS

       Inclusion thresholds are stricter than reporting thresholds.  Inclusion thresholds control
       which hits are considered to be reliable enough to be included in an output alignment or a
       subsequent search round, or marked as significant ("!") as opposed to  questionable  ("?")
       in domain output.

       --incE <x>
              Use  an  E-value  of  <= <x> as the per-target inclusion threshold.  The default is
              0.01, meaning that on average, about 1 false positive would be  expected  in  every
              100 searches with different query sequences.

       --incT <x>
              Instead  of  using  E-values for setting the inclusion threshold, instead use a bit
              score of >= <x> as the per-target inclusion threshold.  By default this  option  is
              unset.

       --incdomE <x>
              Use  a  conditional  E-value  of  <=  <x> as the per-domain inclusion threshold, in
              targets that have already satisfied the  overall  per-target  inclusion  threshold.
              The default is 0.01.

       --incdomT <x>
              Instead  of  using  E-values, use a bit score of >= <x> as the per-domain inclusion
              threshold.

OPTIONS FOR MODEL-SPECIFIC SCORE THRESHOLDING

       Curated profile databases may define specific  bit  score  thresholds  for  each  profile,
       superseding any thresholding based on statistical significance alone.

       To  use  these  options,  the  profile  must  contain  the appropriate (GA, TC, and/or NC)
       optional score threshold annotation; this is picked up by hmmbuild from  Stockholm  format
       alignment  files. Each thresholding option has two scores: the per-sequence threshold <x1>
       and the per-domain threshold <x2> These  act  as  if  -T  <x1>  --incT  <x1>  --domT  <x2>
       --incdomT <x2> has been applied specifically using each model's curated thresholds.

       --cut_ga
              Use  the  GA (gathering) bit scores in the model to set per-sequence (GA1) and per-
              domain (GA2) reporting  and  inclusion  thresholds.  GA  thresholds  are  generally
              considered  to  be  the reliable curated thresholds defining family membership; for
              example, in  Pfam,  these  thresholds  define  what  gets  included  in  Pfam  Full
              alignments based on searches with Pfam Seed models.

       --cut_nc
              Use  the  NC  (noise  cutoff) bit score thresholds in the model to set per-sequence
              (NC1) and per-domain (NC2) reporting and inclusion thresholds.  NC  thresholds  are
              generally considered to be the score of the highest-scoring known false positive.

       --cut_tc
              Use  the  TC (trusted cutoff) bit score thresholds in the model to set per-sequence
              (TC1) and per-domain (TC2) reporting and inclusion thresholds.  TC  thresholds  are
              generally considered to be the score of the lowest-scoring known true positive that
              is above all known false positives.

OPTIONS CONTROLLING THE ACCELERATION PIPELINE

HMMER3 searches are accelerated in a three-step filter pipeline: the MSV filter, the
Viterbi filter, and the Forward filter. The first filter is the fastest and most
approximate; the last is the full Forward scoring algorithm. There is also a bias filter
step between MSV and Viterbi. Targets that pass all the steps in the acceleration pipeline
are then subjected to postprocessing -- domain identification and scoring using the
Forward/Backward algorithm.

Changing filter thresholds only removes or includes targets from consideration; changing
filter thresholds does not alter bit scores, E-values, or alignments, all of which are
determined solely in postprocessing.

--max Turn off all filters, including the bias filter, and run full Forward/Backward
postprocessing on every target. This increases sensitivity somewhat, at a large
cost in speed.

--F1 <x>
Set the P-value threshold for the MSV filter step. The default is 0.02, meaning
that roughly 2% of the highest scoring nonhomologous targets are expected to pass
the filter.

--F2 <x>
Set the P-value threshold for the Viterbi filter step. The default is 0.001.

--F3 <x>
Set the P-value threshold for the Forward filter step. The default is 1e-5.

--nobias
Turn off the bias filter. This increases sensitivity somewhat, but can come at a
high cost in speed, especially if the query has biased residue composition (such as
a repetitive sequence region, or if it is a membrane protein with large regions of
hydrophobicity). Without the bias filter, too many sequences may pass the filter
with biased queries, leading to slower than expected performance as the
computationally intensive Forward/Backward algorithms shoulder an abnormally heavy
load.

OTHER OPTIONS

       --nonull2
              Turn off the null2 score corrections for biased composition.

       -Z <x> Assert  that  the total number of targets in your searches is <x>, for the purposes
              of per-sequence E-value calculations, rather than  the  actual  number  of  targets
              seen.

       --domZ <x>
              Assert  that  the total number of targets in your searches is <x>, for the purposes
              of per-domain conditional E-value calculations, rather than the number  of  targets
              that passed the reporting thresholds.

       --seed <n>
              Set  the  random  number  seed  to <n>.  Some steps in postprocessing require Monte
              Carlo simulation.  The default is to use a fixed seed (42),  so  that  results  are
              exactly  reproducible.  Any  other  positive  integer will give different (but also
              reproducible) results. A choice of 0 uses a randomly chosen seed.

       --tformat <s>
              Assert that target sequence  file  seqfile  is  in  format  <s>,  bypassing  format
              autodetection.   Common  choices  for <s> include: fasta, embl, genbank.  Alignment
              formats also work; common choices include: stockholm, a2m, afa, psiblast,  clustal,
              phylip.  For more information, and for codes for some less common formats, see main
              documentation.  The string <s> is case-insensitive (fasta or FASTA both work).

       --cpu <n>
              Set the number of parallel worker threads  to  <n>.   On  multicore  machines,  the
              default is 2.  You can also control this number by setting an environment variable,
              HMMER_NCPU.  There is also a master thread, so the actual number  of  threads  that
              HMMER spawns is <n>+1.

              This  option  is  not  available  if  HMMER was compiled with POSIX threads support
              turned off.

       --stall
              For debugging the MPI master/worker version:  pause  after  start,  to  enable  the
              developer  to  attach debuggers to the running master and worker(s) processes. Send
              SIGCONT signal to release the pause.   (Under  gdb:  (gdb)  signal  SIGCONT)  (Only
              available if optional MPI support was enabled at compile-time.)

       --mpi  Run  under  MPI  control  with  master/worker  parallelization  (using  mpirun, for
              example, or equivalent). Only available if optional  MPI  support  was  enabled  at
              compile-time.

COPYRIGHT

       Copyright (C) 2023 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For  additional  information  on copyright and licensing, see the file called COPYRIGHT in
       your HMMER source distribution, or see the HMMER web page (http://hmmer.org/).

AUTHOR

       http://eddylab.org