Ubuntu Manpage: hmmbuild - construct profiles from multiple sequence alignments

NAME

       hmmbuild - construct profiles from multiple sequence alignments

SYNOPSIS

       hmmbuild [options] hmmfile msafile

DESCRIPTION

       For  each  multiple sequence alignment in msafile build a profile HMM and save it to a new
       file hmmfile.

       msafile may be '-' (dash), which means reading this input from stdin rather than a file.

       hmmfile may not be '-' (stdout), because sending the HMM file  to  stdout  would  conflict
       with the other text output of the program.

OPTIONS

       -h     Help; print a brief reminder of command line usage and all available options.

       -n <s> Name  the new profile <s>.  The default is to use the name of the alignment (if one
              is present in the msafile, or, failing that, the name of the hmmfile.   If  msafile
              contains  more than one alignment, -n doesn't work, and every alignment must have a
              name annotated in the msafile (as in Stockholm #=GF ID annotation).

       -o <f> Direct the summary output to file <f>, rather than to stdout.

       -O <f> After each  model  is  constructed,  resave  annotated,  possibly  modified  source
              alignments  to a file <f> in Stockholm format.  The alignments are annotated with a
              reference annotation line indicating which columns were assigned as consensus,  and
              sequences  are  annotated  with  what relative sequence weights were assigned. Some
              residues of the alignment may have been shifted to accommodate restrictions of  the
              Plan7  profile  architecture, which disallows transitions between insert and delete
              states.

OPTIONS FOR SPECIFYING THE ALPHABET

       --amino
              Assert that sequences in msafile are protein, bypassing alphabet autodetection.

       --dna  Assert that sequences in msafile are DNA, bypassing alphabet autodetection.

       --rna  Assert that sequences in msafile are RNA, bypassing alphabet autodetection.

OPTIONS CONTROLLING PROFILE CONSTRUCTION

These options control how consensus columns are defined in an alignment.

--fast Define consensus columns as those that have a fraction >= symfrac of residues as
opposed to gaps. (See below for the --symfrac option.) This is the default.

--hand Define consensus columns in next profile using reference annotation to the multiple
alignment. This allows you to define any consensus columns you like.

--symfrac <x>
Define the residue fraction threshold necessary to define a consensus column when
using the --fast option. The default is 0.5. The symbol fraction in each column is
calculated after taking relative sequence weighting into account, and ignoring gap
characters corresponding to ends of sequence fragments (as opposed to internal
insertions/deletions). Setting this to 0.0 means that every alignment column will
be assigned as consensus, which may be useful in some cases. Setting it to 1.0
means that only columns that include 0 gaps (internal insertions/deletions) will be
assigned as consensus.

--fragthresh <x>
We only want to count terminal gaps as deletions if the aligned sequence is known
to be full-length, not if it is a fragment (for instance, because only part of it
was sequenced). HMMER uses a simple rule to infer fragments: if the range of a
sequence in the alignment (the number of alignment columns between the first and
last positions of the sequence) is less than or equal to a fraction <x> times the
alignment length in columns, then the sequence is handled as a fragment. The
default is 0.5. Setting --fragthresh 0 will define no (nonempty) sequence as a
fragment; you might want to do this if you know you've got a carefully curated
alignment of full-length sequences. Setting --fragthresh 1 will define all
sequences as fragments; you might want to do this if you know your alignment is
entirely composed of fragments, such as translated short reads in metagenomic
shotgun data.

OPTIONS CONTROLLING RELATIVE WEIGHTS

       HMMER uses an ad hoc sequence weighting algorithm to downweight closely related  sequences
       and  upweight  distantly related ones. This has the effect of making models less biased by
       uneven phylogenetic representation. For example, two identical sequences  would  typically
       each  receive  half  the  weight  that  one  sequence  would.  These options control which
       algorithm gets used.

       --wpb  Use the Henikoff position-based sequence weighting scheme [Henikoff  and  Henikoff,
              J. Mol. Biol. 243:574, 1994].  This is the default.

       --wgsc Use  the  Gerstein/Sonnhammer/Chothia  weighting algorithm [Gerstein et al, J. Mol.
              Biol. 235:1067, 1994].

       --wblosum
              Use the same clustering scheme that was used to weight data in  calculating  BLOSUM
              substitution  matrices  [Henikoff  and  Henikoff,  Proc.  Natl. Acad. Sci 89:10915,
              1992]. Sequences are single-linkage clustered at  an  identity  threshold  (default
              0.62;  see  --wid)  and  within  each  cluster  of  c sequences, each sequence gets
              relative weight 1/c.

       --wnone
              No relative weights. All sequences are assigned uniform weight.

       --wid <x>
              Sets the identity threshold used by single-linkage clustering when using --wblosum.
              Invalid with any other weighting scheme. Default is 0.62.

OPTIONS CONTROLLING EFFECTIVE SEQUENCE NUMBER

After relative weights are determined, they are normalized to sum to a total effective
sequence number, eff_nseq. This number may be the actual number of sequences in the
alignment, but it is almost always smaller than that. The default entropy weighting
method (--eent) reduces the effective sequence number to reduce the information content
(relative entropy, or average expected score on true homologs) per consensus position. The
target relative entropy is controlled by a two-parameter function, where the two
parameters are settable with --ere and --esigma.

--eent Adjust effective sequence number to achieve a specific relative entropy per
position (see --ere). This is the default.

--eclust
Set effective sequence number to the number of single-linkage clusters at a
specific identity threshold (see --eid). This option is not recommended; it's for
experiments evaluating how much better --eent is.

--enone
Turn off effective sequence number determination and just use the actual number of
sequences. One reason you might want to do this is to try to maximize the relative
entropy/position of your model, which may be useful for short models.

--eset <x>
Explicitly set the effective sequence number for all models to <x>.

--ere <x>
Set the minimum relative entropy/position target to <x>. Requires --eent. Default
depends on the sequence alphabet. For protein sequences, it is 0.59 bits/position;
for nucleotide sequences, it is 0.45 bits/position.

--esigma <x>
Sets the minimum relative entropy contributed by an entire model alignment, over
its whole length. This has the effect of making short models have higher relative
entropy per position than --ere alone would give. The default is 45.0 bits.

--eid <x>
Sets the fractional pairwise identity cutoff used by single linkage clustering with
the --eclust option. The default is 0.62.

OPTIONS CONTROLLING PRIORS

       By  default,  weighted  counts  are  converted  to  mean  posterior  probability parameter
       estimates using mixture Dirichlet priors.  Default mixture Dirichlet prior parameters  for
       protein  models  and  for  nucleic  acid  (RNA and DNA) models are built in. The following
       options allow you to override the default priors.

       --pnone
              Don't  use  any  priors.  Probability  parameters  will  simply  be  the   observed
              frequencies, after relative sequence weighting.

       --plaplace
              Use a Laplace +1 prior in place of the default mixture Dirichlet prior.

OPTIONS CONTROLLING SINGLE SEQUENCE SCORING

By default, if a query is a single sequence from a file in fasta format, hmmbuild
constructs a search model from that sequence and a standard 20x20 substitution matrix for
residue probabilities, along with two additional parameters for position-independent gap
open and gap extend probabilities. These options allow the default single-sequence scoring
parameters to be changed, and for single-sequence scoring options to be applied to a
single sequence coming from an aligned format.

--singlemx
If a single sequence query comes from a multiple sequence alignment file, such as
in stockholm format, the search model is by default constructed as is typically
done for multiple sequence alignments. This option forces hmmbuild to use the
single-sequence method with substitution score matrix.

--mx <s>
Obtain residue alignment probabilities from the built-in substitution matrix named
<s>. Several standard matrices are built-in, and do not need to be read from
files. The matrix name <s> can be PAM30, PAM70, PAM120, PAM240, BLOSUM45,
BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90, or DNA1. Only one of the --mx and --mxfile
options may be used.

--mxfile <mxfile>
Obtain residue alignment probabilities from the substitution matrix in file
<mxfile>. The default score matrix is BLOSUM62 for protein sequences, and DNA1 for
nucleotide sequences (these matrices are internal to HMMER and do not need to be
available as a file). The format of a substitution matrix <mxfile> is the standard
format accepted by BLAST, FASTA, and other sequence analysis software. See
ftp.ncbi.nlm.nih.gov/blast/matrices/ for example files. (The only exception: we
require matrices to be square, so for DNA, use files like NCBI's NUC.4.4, not
NUC.4.2.)

--popen <x>
Set the gap open probability for a single sequence query model to <x>. The default
is 0.02. <x> must be >= 0 and < 0.5.

--pextend <x>
Set the gap extend probability for a single sequence query model to <x>. The
default is 0.4. <x> must be >= 0 and < 1.0.

OPTIONS CONTROLLING E-VALUE CALIBRATION

       The location parameters for the  expected  score  distributions  for  MSV  filter  scores,
       Viterbi filter scores, and Forward scores require three short random sequence simulations.

       --EmL <n>
              Sets the sequence length in simulation that estimates the location parameter mu for
              MSV filter E-values. Default is 200.

       --EmN <n>
              Sets the number of sequences in simulation that estimates the location parameter mu
              for MSV filter E-values. Default is 200.

       --EvL <n>
              Sets the sequence length in simulation that estimates the location parameter mu for
              Viterbi filter E-values. Default is 200.

       --EvN <n>
              Sets the number of sequences in simulation that estimates the location parameter mu
              for Viterbi filter E-values. Default is 200.

       --EfL <n>
              Sets  the  sequence  length in simulation that estimates the location parameter tau
              for Forward E-values. Default is 100.

       --EfN <n>
              Sets the number of sequences in simulation that estimates  the  location  parameter
              tau for Forward E-values. Default is 200.

       --Eft <x>
              Sets  the  tail  mass fraction to fit in the simulation that estimates the location
              parameter tau for Forward evalues. Default is 0.04.

OTHER OPTIONS

       --cpu <n>
              Set the number of parallel worker threads  to  <n>.   On  multicore  machines,  the
              default is 2.  You can also control this number by setting an environment variable,
              HMMER_NCPU.  There is also a master thread, so the actual number  of  threads  that
              HMMER spawns is <n>+1.

              This  option  is  not  available  if  HMMER was compiled with POSIX threads support
              turned off.

       --informat <s>
              Assert  that  input  msafile  is  in  alignment  format   <s>,   bypassing   format
              autodetection.   Common  choices  for  <s>  include: stockholm, a2m, afa, psiblast,
              clustal, phylip.  For more information, and for codes for some less common formats,
              see main documentation.  The string <s> is case-insensitive (a2m or A2M both work).

       --seed <n>
              Seed the random number generator with <n>, an integer >= 0.  If <n> is nonzero, any
              stochastic simulations will be reproducible; the same command will  give  the  same
              results.   If  <n>  is  0,  the  random number generator is seeded arbitrarily, and
              stochastic simulations will vary from run to run of the same command.  The  default
              seed is 42.

       --w_beta <x>
              Window length tail mass.  The upper bound, W, on the length at which nhmmer expects
              to find an instance of the model is set such that the  fraction  of  all  sequences
              generated by the model with length >= W is less than <x>.  The default is 1e-7.

       --w_length <n>
              Override the model instance length upper bound, W, which is otherwise controlled by
              --w_beta.  It should be larger than the model length. The value of W is  used  deep
              in the acceleration pipeline, and modest changes are not expected to impact results
              (though larger values of W do lead to longer run time).

       --mpi  Run as a parallel MPI program. Each alignment is assigned to a MPI worker node  for
              construction.  (Therefore,  the maximum parallelization cannot exceed the number of
              alignments in the input msafile.)  This  is  useful  when  building  large  profile
              libraries.  This option is only available if optional MPI capability was enabled at
              compile-time.

       --stall
              For debugging MPI  parallelization:  arrest  program  execution  immediately  after
              start,  and  wait  for  a debugger to attach to the running process and release the
              arrest.

       --maxinsertlen <n>
              Restrict insert length parameterization such that the  expected  insert  length  at
              each position of the model is no more than <n>.

COPYRIGHT

       Copyright (C) 2020 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For  additional  information  on copyright and licensing, see the file called COPYRIGHT in
       your HMMER source distribution, or see the HMMER web page (http://hmmer.org/).

AUTHOR

       http://eddylab.org