Provided by: hmmer_3.3+dfsg2-1_amd64 bug

NAME

       hmmbuild - construct profiles from multiple sequence alignments

SYNOPSIS

       hmmbuild [options] hmmfile msafile

DESCRIPTION

       For  each  multiple sequence alignment in msafile build a profile HMM and save it to a new
       file hmmfile.

       msafile may be '-' (dash), which means reading this input from stdin rather than a file.

       hmmfile may not be '-' (stdout), because sending the HMM file  to  stdout  would  conflict
       with the other text output of the program.

OPTIONS

       -h     Help; print a brief reminder of command line usage and all available options.

       -n <s> Name  the new profile <s>.  The default is to use the name of the alignment (if one
              is present in the msafile, or, failing that, the name of the hmmfile.   If  msafile
              contains  more than one alignment, -n doesn't work, and every alignment must have a
              name annotated in the msafile (as in Stockholm #=GF ID annotation).

       -o <f> Direct the summary output to file <f>, rather than to stdout.

       -O <f> After each  model  is  constructed,  resave  annotated,  possibly  modified  source
              alignments  to a file <f> in Stockholm format.  The alignments are annotated with a
              reference annotation line indicating which columns were assigned as consensus,  and
              sequences  are  annotated  with  what relative sequence weights were assigned. Some
              residues of the alignment may have been shifted to accommodate restrictions of  the
              Plan7  profile  architecture, which disallows transitions between insert and delete
              states.

OPTIONS FOR SPECIFYING THE ALPHABET

       --amino
              Assert that sequences in msafile are protein, bypassing alphabet autodetection.

       --dna  Assert that sequences in msafile are DNA, bypassing alphabet autodetection.

       --rna  Assert that sequences in msafile are RNA, bypassing alphabet autodetection.

OPTIONS CONTROLLING PROFILE CONSTRUCTION

       These options control how consensus columns are defined in an alignment.

       --fast Define consensus columns as those that have a fraction >= symfrac  of  residues  as
              opposed to gaps. (See below for the --symfrac option.) This is the default.

       --hand Define consensus columns in next profile using reference annotation to the multiple
              alignment.  This allows you to define any consensus columns you like.

       --symfrac <x>
              Define the residue fraction threshold necessary to define a consensus  column  when
              using  the --fast option. The default is 0.5. The symbol fraction in each column is
              calculated after taking relative sequence weighting into account, and ignoring  gap
              characters  corresponding  to  ends  of  sequence fragments (as opposed to internal
              insertions/deletions).  Setting this to 0.0 means that every alignment column  will
              be  assigned  as  consensus,  which  may be useful in some cases. Setting it to 1.0
              means that only columns that include 0 gaps (internal insertions/deletions) will be
              assigned as consensus.

       --fragthresh <x>
              We  only  want to count terminal gaps as deletions if the aligned sequence is known
              to be full-length, not if it is a fragment (for instance, because only part  of  it
              was  sequenced).  HMMER  uses  a  simple rule to infer fragments: if the range of a
              sequence in the alignment (the number of alignment columns between  the  first  and
              last  positions  of the sequence) is less than or equal to a fraction <x> times the
              alignment length in columns, then the  sequence  is  handled  as  a  fragment.  The
              default  is  0.5.   Setting  --fragthresh 0 will define no (nonempty) sequence as a
              fragment; you might want to do this if you know  you've  got  a  carefully  curated
              alignment  of  full-length  sequences.   Setting  --fragthresh  1  will  define all
              sequences as fragments; you might want to do this if you  know  your  alignment  is
              entirely  composed  of  fragments,  such  as  translated short reads in metagenomic
              shotgun data.

OPTIONS CONTROLLING RELATIVE WEIGHTS

       HMMER uses an ad hoc sequence weighting algorithm to downweight closely related  sequences
       and  upweight  distantly related ones. This has the effect of making models less biased by
       uneven phylogenetic representation. For example, two identical sequences  would  typically
       each  receive  half  the  weight  that  one  sequence  would.  These options control which
       algorithm gets used.

       --wpb  Use the Henikoff position-based sequence weighting scheme [Henikoff  and  Henikoff,
              J. Mol. Biol. 243:574, 1994].  This is the default.

       --wgsc Use  the  Gerstein/Sonnhammer/Chothia  weighting algorithm [Gerstein et al, J. Mol.
              Biol. 235:1067, 1994].

       --wblosum
              Use the same clustering scheme that was used to weight data in  calculating  BLOSUM
              subsitution matrices [Henikoff and Henikoff, Proc. Natl. Acad. Sci 89:10915, 1992].
              Sequences are single-linkage clustered at an identity threshold (default 0.62;  see
              --wid)  and  within each cluster of c sequences, each sequence gets relative weight
              1/c.

       --wnone
              No relative weights. All sequences are assigned uniform weight.

       --wid <x>
              Sets the identity threshold used by single-linkage clustering when using --wblosum.
              Invalid with any other weighting scheme. Default is 0.62.

OPTIONS CONTROLLING EFFECTIVE SEQUENCE NUMBER

       After  relative  weights  are  determined, they are normalized to sum to a total effective
       sequence number, eff_nseq.  This number may be the  actual  number  of  sequences  in  the
       alignment,  but  it  is  almost  always  smaller than that.  The default entropy weighting
       method (--eent) reduces the effective sequence number to reduce  the  information  content
       (relative entropy, or average expected score on true homologs) per consensus position. The
       target relative  entropy  is  controlled  by  a  two-parameter  function,  where  the  two
       parameters are settable with --ere and --esigma.

       --eent Adjust  effective  sequence  number  to  achieve  a  specific  relative entropy per
              position (see --ere).  This is the default.

       --eclust
              Set effective sequence number  to  the  number  of  single-linkage  clusters  at  a
              specific  identity threshold (see --eid).  This option is not recommended; it's for
              experiments evaluating how much better --eent is.

       --enone
              Turn off effective sequence number determination and just use the actual number  of
              sequences.  One reason you might want to do this is to try to maximize the relative
              entropy/position of your model, which may be useful for short models.

       --eset <x>
              Explicitly set the effective sequence number for all models to <x>.

       --ere <x>
              Set the minimum relative entropy/position target to <x>.  Requires --eent.  Default
              depends  on the sequence alphabet. For protein sequences, it is 0.59 bits/position;
              for nucleotide sequences, it is 0.45 bits/position.

       --esigma <x>
              Sets the minimum relative entropy contributed by an entire  model  alignment,  over
              its  whole  length. This has the effect of making short models have higher relative
              entropy per position than --ere alone would give. The default is 45.0 bits.

       --eid <x>
              Sets the fractional pairwise identity cutoff used by single linkage clustering with
              the --eclust option. The default is 0.62.

OPTIONS CONTROLLING PRIORS

       By  default,  weighted  counts  are  converted  to  mean  posterior  probability parameter
       estimates using mixture Dirichlet priors.  Default mixture Dirichlet prior parameters  for
       protein  models  and  for  nucleic  acid  (RNA and DNA) models are built in. The following
       options allow you to override the default priors.

       --pnone
              Don't  use  any  priors.  Probability  parameters  will  simply  be  the   observed
              frequencies, after relative sequence weighting.

       --plaplace
              Use a Laplace +1 prior in place of the default mixture Dirichlet prior.

OPTIONS CONTROLLING SINGLE SEQUENCE SCORING

       By  default,  if  a  query  is  a  single  sequence  from a file in fasta format, hmmbuild
       constructs a search model from that sequence and a standard 20x20 substitution matrix  for
       residue  probabilities,  along with two additional parameters for position-independent gap
       open and gap extend probabilities. These options allow the default single-sequence scoring
       parameters  to  be  changed,  and  for  single-sequence scoring options to be applied to a
       single sequence coming from an aligned format.

       --singlemx
              If a single sequence query comes from a multiple sequence alignment file,  such  as
              in  stockholm  format,  the  search model is by default constructed as is typically
              done for multiple sequence alignments. This  option  forces  hmmbuild  to  use  the
              single-sequence method with substitution score matrix.

       --mx <s>
              Obtain  residue alignment probabilities from the built-in substitution matrix named
              <s>.  Several standard matrices are built-in, and do  not  need  to  be  read  from
              files.   The  matrix  name  <s>  can  be  PAM30,  PAM70,  PAM120, PAM240, BLOSUM45,
              BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90, or DNA1.  Only one of the --mx and --mxfile
              options may be used.

       --mxfile <mxfile>
              Obtain  residue  alignment  probabilities  from  the  substitution  matrix  in file
              <mxfile>.  The default score matrix is BLOSUM62 for protein sequences, and DNA1 for
              nucleotide  sequences  (these  matrices are internal to HMMER and do not need to be
              available as a file).  The format of a substitution matrix <mxfile> is the standard
              format  accepted  by  BLAST,  FASTA,  and  other  sequence  analysis software.  See
              ftp.ncbi.nlm.nih.gov/blast/matrices/ for example files.  (The  only  exception:  we
              require  matrices  to  be  square,  so  for DNA, use files like NCBI's NUC.4.4, not
              NUC.4.2.)

       --popen <x>
              Set the gap open probability for a single sequence query model to <x>.  The default
              is 0.02.  <x> must be >= 0 and < 0.5.

       --pextend <x>
              Set  the  gap  extend  probability  for  a single sequence query model to <x>.  The
              default is 0.4.  <x> must be >= 0 and < 1.0.

OPTIONS CONTROLLING E-VALUE CALIBRATION

       The location parameters for the  expected  score  distributions  for  MSV  filter  scores,
       Viterbi filter scores, and Forward scores require three short random sequence simulations.

       --EmL <n>
              Sets the sequence length in simulation that estimates the location parameter mu for
              MSV filter E-values. Default is 200.

       --EmN <n>
              Sets the number of sequences in simulation that estimates the location parameter mu
              for MSV filter E-values. Default is 200.

       --EvL <n>
              Sets the sequence length in simulation that estimates the location parameter mu for
              Viterbi filter E-values. Default is 200.

       --EvN <n>
              Sets the number of sequences in simulation that estimates the location parameter mu
              for Viterbi filter E-values. Default is 200.

       --EfL <n>
              Sets  the  sequence  length in simulation that estimates the location parameter tau
              for Forward E-values. Default is 100.

       --EfN <n>
              Sets the number of sequences in simulation that estimates  the  location  parameter
              tau for Forward E-values. Default is 200.

       --Eft <x>
              Sets  the  tail  mass fraction to fit in the simulation that estimates the location
              parameter tau for Forward evalues. Default is 0.04.

OTHER OPTIONS

       --cpu <n>
              Set the number of parallel worker threads  to  <n>.   On  multicore  machines,  the
              default is 2.  You can also control this number by setting an environment variable,
              HMMER_NCPU.  There is also a master thread, so the actual number  of  threads  that
              HMMER spawns is <n>+1.

              This  option  is  not  available  if  HMMER was compiled with POSIX threads support
              turned off.

       --informat <s>
              Assert  that  input  msafile  is  in  alignment  format   <s>,   bypassing   format
              autodetection.   Common  choices  for  <s>  include: stockholm, a2m, afa, psiblast,
              clustal, phylip.  For more information, and for codes for some less common formats,
              see main documentation.  The string <s> is case-insensitive (a2m or A2M both work).

       --seed <n>
              Seed the random number generator with <n>, an integer >= 0.  If <n> is nonzero, any
              stochastic simulations will be reproducible; the same command will  give  the  same
              results.   If  <n>  is  0,  the  random number generator is seeded arbitrarily, and
              stochastic simulations will vary from run to run of the same command.  The  default
              seed is 42.

       --w_beta <x>
              Window length tail mass.  The upper bound, W, on the length at which nhmmer expects
              to find an instance of the model is set such that the  fraction  of  all  sequences
              generated by the model with length >= W is less than <x>.  The default is 1e-7.

       --w_length <n>
              Override the model instance length upper bound, W, which is otherwise controlled by
              --w_beta.  It should be larger than the model length. The value of W is  used  deep
              in the acceleration pipeline, and modest changes are not expected to impact results
              (though larger values of W do lead to longer run time).

       --mpi  Run as a parallel MPI program. Each alignment is assigned to a MPI worker node  for
              construction.  (Therefore,  the maximum parallelization cannot exceed the number of
              alignments in the input msafile.)  This  is  useful  when  building  large  profile
              libraries.  This option is only available if optional MPI capability was enabled at
              compile-time.

       --stall
              For debugging MPI  parallelization:  arrest  program  execution  immediately  after
              start,  and  wait  for  a debugger to attach to the running process and release the
              arrest.

       --maxinsertlen <n>
              Restrict insert length parameterization such that the  expected  insert  length  at
              each position of the model is no more than <n>.

SEE ALSO

       See  hmmer(1)  for  a  master  man  page  with  a list of all the individual man pages for
       programs in the HMMER package.

       For complete documentation, see the user guide that  came  with  your  HMMER  distribution
       (Userguide.pdf); or see the HMMER web page (http://hmmer.org/).

COPYRIGHT

       Copyright (C) 2019 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For  additional  information  on copyright and licensing, see the file called COPYRIGHT in
       your HMMER source distribution, or see the HMMER web page (http://hmmer.org/).

AUTHOR

       http://eddylab.org