Ubuntu Manpage: hmm2build - build a profile HMM from an alignment

NAME

       hmm2build - build a profile HMM from an alignment

SYNOPSIS

       hmm2build [options] hmmfile alignfile

DESCRIPTION

       hmm2build  reads  a multiple sequence alignment file alignfile , builds a new profile HMM,
       and saves the HMM in hmmfile.

       alignfile may be in ClustalW, GCG  MSF,  SELEX,  Stockholm,  or  aligned  FASTA  alignment
       format. The format is automatically detected.

       By  default,  the model is configured to find one or more nonoverlapping alignments to the
       complete model: multiple global alignments with respect  to  the  model,  and  local  with
       respect  to the sequence.  This is analogous to the behavior of the hmmls program of HMMER
       1.  To configure the model for multiple local alignments with respect  to  the  model  and
       local  with  respect  to  the  sequence, a la the old program hmmfs, use the -f (fragment)
       option. More rarely, you may want to configure the model for  a  single  global  alignment
       (global with respect to both model and sequence), using the -g option; or to configure the
       model for a single local/local alignment (a la standard Smith/Waterman, or the  old  hmmsw
       program), use the -s option.

OPTIONS

-f Configure the model for finding multiple domains per sequence, where each domain
can be a local (fragmentary) alignment. This is analogous to the old hmmfs program
of HMMER 1.

-g Configure the model for finding a single global alignment to a target sequence,
analogous to the old hmms program of HMMER 1.

-h Print brief help; includes version number and summary of all options, including
expert options.

-n <s> Name this HMM <s>. <s> can be any string of non-whitespace characters (e.g. one
"word"). There is no length limit (at least not one imposed by HMMER; your shell
will complain about command line lengths first).

-o <f> Re-save the starting alignment to <f>, in Stockholm format. The columns which were
assigned to match states will be marked with x's in an #=RF annotation line. If
either the --hand or --fast construction options were chosen, the alignment may
have been slightly altered to be compatible with Plan 7 transitions, so saving the
final alignment and comparing to the starting alignment can let you view these
alterations. See the User's Guide for more information on this arcane side effect.

-s Configure the model for finding a single local alignment per target sequence. This
is analogous to the standard Smith/Waterman algorithm or the hmmsw program of HMMER
1.

-A Append this model to an existing hmmfile rather than creating hmmfile. Useful for
building HMM libraries (like Pfam).

-F Force overwriting of an existing hmmfile. Otherwise HMMER will refuse to clobber
your existing HMM files, for safety's sake.

EXPERT OPTIONS

--amino
Force the sequence alignment to be interpreted as amino acid sequences. Normally
HMMER autodetects whether the alignment is protein or DNA, but sometimes alignments
are so small that autodetection is ambiguous. See --nucleic.

--archpri <x>
Set the "architecture prior" used by MAP architecture construction to <x>, where
<x> is a probability between 0 and 1. This parameter governs a geometric prior
distribution over model lengths. As <x> increases, longer models are favored a
priori. As <x> decreases, it takes more residue conservation in a column to make a
column a "consensus" match column in the model architecture. The 0.85 default has
been chosen empirically as a reasonable setting.

--binary
Write the HMM to hmmfile in HMMER binary format instead of readable ASCII text.

--cfile <f>
Save the observed emission and transition counts to <f> after the architecture has
been determined (e.g. after residues/gaps have been assigned to match, delete, and
insert states). This option is used in HMMER development for generating data files
useful for training new Dirichlet priors. The format of count files is documented
in the User's Guide.

--fast Quickly and heuristically determine the architecture of the model by assigning all
columns will more than a certain fraction of gap characters to insert states. By
default this fraction is 0.5, and it can be changed using the --gapmax option. The
default construction algorithm is a maximum a posteriori (MAP) algorithm, which is
slower.

--gapmax <x>
Controls the --fast model construction algorithm, but if --fast is not being used,
has no effect. If a column has more than a fraction <x> of gap symbols in it, it
gets assigned to an insert column. <x> is a frequency from 0 to 1, and by default
is set to 0.5. Higher values of <x> mean more columns get assigned to consensus,
and models get longer; smaller values of <x> mean fewer columns get assigned to
consensus, and models get smaller. <x>

--hand Specify the architecture of the model by hand: the alignment file must be in SELEX
or Stockholm format, and the reference annotation line (#=RF in SELEX, #=GC RF in
Stockholm) is used to specify the architecture. Any column marked with a non-gap
symbol (such as an 'x', for instance) is assigned as a consensus (match) column in
the model.

--idlevel <x>
Controls both the determination of effective sequence number and the behavior of
the --wblosum weighting option. The sequence alignment is clustered by percent
identity, and the number of clusters at a cutoff threshold of <x> is used to
determine the effective sequence number. Higher values of <x> give more clusters
and higher effective sequence numbers; lower values of <x> give fewer clusters and
lower effective sequence numbers. <x> is a fraction from 0 to 1, and by default is
set to 0.62 (corresponding to the clustering level used in constructing the
BLOSUM62 substitution matrix).

--informat <s>
Assert that the input seqfile is in format <s>; do not run Babelfish format
autodection. This increases the reliability of the program somewhat, because the
Babelfish can make mistakes; particularly recommended for unattended, high-
throughput runs of HMMER. Valid format strings include FASTA, GENBANK, EMBL, GCG,
PIR, STOCKHOLM, SELEX, MSF, CLUSTAL, and PHYLIP. See the User's Guide for a
complete list.

--noeff
Turn off the effective sequence number calculation, and use the true number of
sequences instead. This will usually reduce the sensitivity of the final model (so
don't do it without good reason!)

--nucleic
Force the alignment to be interpreted as nucleic acid sequence, either RNA or DNA.
Normally HMMER autodetects whether the alignment is protein or DNA, but sometimes
alignments are so small that autodetection is ambiguous. See --amino.

--null <f>
Read a null model from <f>. The default for protein is to use average amino acid
frequencies from Swissprot 34 and p1 = 350/351; for nucleic acid, the default is to
use 0.25 for each base and p1 = 1000/1001. For documentation of the format of the
null model file and further explanation of how the null model is used, see the
User's Guide.

--pam <f>
Apply a heuristic PAM- (substitution matrix-) based prior on match emission
probabilities instead of the default mixture Dirichlet. The substitution matrix is
read from <f>. See --pamwgt.

The default Dirichlet state transition prior and insert emission prior are
unaffected. Therefore in principle you could combine --prior with --pam but this
isn't recommended, as it hasn't been tested. ( --pam itself hasn't been tested
much!)

--pamwgt <x>
Controls the weight on a PAM-based prior. Only has effect if --pam option is also
in use. <x> is a positive real number, 20.0 by default. <x> is the number of
"pseudocounts" contriubuted by the heuristic prior. Very high values of <x> can
force a scoring system that is entirely driven by the substitution matrix, making
HMMER somewhat approximate Gribskov profiles.

--pbswitch <n>
For alignments with a very large number of sequences, the GSC, BLOSUM, and Voronoi
weighting schemes are slow; they're O(N^2) for N sequences. Henikoff position-based
weights (PB weights) are more efficient. At or above a certain threshold sequence
number <n> hmm2build will switch from GSC, BLOSUM, or Voronoi weights to PB
weights. To disable this switching behavior (at the cost of compute time, set <n>
to be something larger than the number of sequences in your alignment. <n> is a
positive integer; the default is 1000.

--prior <f>
Read a Dirichlet prior from <f>, replacing the default mixture Dirichlet. The
format of prior files is documented in the User's Guide, and an example is given in
the Demos directory of the HMMER distribution.

--swentry <x>
Controls the total probability that is distributed to local entries into the model,
versus starting at the beginning of the model as in a global alignment. <x> is a
probability from 0 to 1, and by default is set to 0.5. Higher values of <x> mean
that hits that are fragments on their left (N or 5'-terminal) side will be
penalized less, but complete global alignments will be penalized more. Lower
values of <x> mean that fragments on the left will be penalized more, and global
alignments on this side will be favored. This option only affects the
configurations that allow local alignments, e.g. -s and -f; unless one of these
options is also activated, this option has no effect. You have independent control
over local/global alignment behavior for the N/C (5'/3') termini of your target
sequences using --swentry and --swexit.

--swexit <x>
Controls the total probability that is distributed to local exits from the model,
versus ending an alignment at the end of the model as in a global alignment. <x>
is a probability from 0 to 1, and by default is set to 0.5. Higher values of <x>
mean that hits that are fragments on their right (C or 3'-terminal) side will be
penalized less, but complete global alignments will be penalized more. Lower
values of <x> mean that fragments on the right will be penalized more, and global
alignments on this side will be favored. This option only affects the
configurations that allow local alignments, e.g. -s and -f; unless one of these
options is also activated, this option has no effect. You have independent control
over local/global alignment behavior for the N/C (5'/3') termini of your target
sequences using --swentry and --swexit.

--verbose
Print more possibly useful stuff, such as the individual scores for each sequence
in the alignment.

--wblosum
Use the BLOSUM filtering algorithm to weight the sequences, instead of the default.
Cluster the sequences at a given percentage identity (see --idlevel); assign each
cluster a total weight of 1.0, distributed equally amongst the members of that
cluster.

--wgsc Use the Gerstein/Sonnhammer/Chothia ad hoc sequence weighting algorithm. This is
already the default, so this option has no effect (unless it follows another option
in the -\-w family, in which case it overrides it).

--wme Use the Krogh/Mitchison maximum entropy algorithm to "weight" the sequences. This
supersedes the Eddy/Mitchison/Durbin maximum discrimination algorithm, which gives
almost identical weights but is less robust. ME weighting seems to give a marginal
increase in sensitivity over the default GSC weights, but takes a fair amount of
time.

--wnone
Turn off all sequence weighting.

--wpb Use the Henikoff position-based weighting scheme.

--wvoronoi
Use the Sibbald/Argos Voronoi sequence weighting algorithm in place of the default
GSC weighting.

COPYRIGHT

       Copyright (C) 1992-2003 HHMI/Washington University School of Medicine.
       Freely distributed under the GNU General Public License (GPL).
       See the file COPYING in your distribution for details on redistribution conditions.

AUTHOR

       Sean Eddy
       HHMI/Dept. of Genetics
       Washington Univ. School of Medicine
       4566 Scott Ave.
       St Louis, MO 63110 USA
       http://www.genetics.wustl.edu/eddy/