Ubuntu Manpage: alimask - Add mask line to a multiple sequence alignment

NAME

       alimask - Add mask line to a multiple sequence alignment

SYNOPSIS

       alimask [options] <msafile> <postmsafile>

DESCRIPTION

alimask is used to apply a mask line to a multiple sequence alignment, based on provided
alignment or model coordinates. When hmmbuild receives a masked alignment as input, it
produces a profile model in which the emission probabilities at masked positions are set
to match the background frequency, rather than being set based on observed frequencies in
the alignment. Position-specific insertion and deletion rates are not altered, even in
masked regions. alimask autodetects input format, and produces masked alignments in
Stockholm format. <msafile> may contain only one sequence alignment.

A common motivation for masking a region in an alignment is that the region contains a
simple tandem repeat that is observed to cause an unacceptably high rate of false positive
hits.

In the simplest case, a mask range is given in coordinates relative to the input
alignment, using --alirange <s>. However it is more often the case that the region to be
masked has been identified in coordinates relative to the profile model (e.g. based on
recognizing a simple repeat pattern in false hit alignments or in the HMM logo). Not all
alignment columns are converted to match state positions in the profile (see the --symfrac
flag for hmmbuild for discussion), so model positions do not necessarily match up to
alignment column positions. To remove the burden of converting model positions to
alignment positions, alimask accepts the mask range input in model coordinates as well,
using --modelrange <s>. When using this flag, alimask determines which alignment
positions would be identified by hmmbuild as match states, a process that requires that
all hmmbuild flags impacting that decision be supplied to alimask. It is for this reason
that many of the hmmbuild flags are also used by alimask.

OPTIONS

       -h     Help; print a brief reminder of command line usage and all available options.

       -o <f> Direct the summary output to file <f>, rather than to stdout.

OPTIONS FOR SPECIFYING MASK RANGE

       A single mask range is given  as  a  dash-separated  pair,  like  --modelrange  10-20  and
       multiple ranges may be submitted as a comma-separated list, --modelrange 10-20,30-42.

       --modelrange <s>
              Supply the given range(s) in model coordinates.

       --alirange <s>
              Supply the given range(s) in alignment coordinates.

       --apendmask
              Add to the existing mask found with the alignment.  The default is to overwrite any
              existing mask.

       --model2ali <s>
              Rather than actually produce the masked  alignment,  simply  print  model  range(s)
              corresponding to input alignment range(s).

       --ali2model <s>
              Rather  than actually produce the masked alignment, simply print alignment range(s)
              corresponding to input model range(s).

OPTIONS FOR SPECIFYING THE ALPHABET

       The alphabet type (amino, DNA, or RNA) is autodetected  by  default,  by  looking  at  the
       composition  of  the  msafile.  Autodetection is normally quite reliable, but occasionally
       alphabet type may be ambiguous and autodetection can  fail  (for  instance,  on  tiny  toy
       alignments  of just a few residues). To avoid this, or to increase robustness in automated
       analysis pipelines, you may specify the alphabet type of msafile with these options.

       --amino
              Specify that all sequences in msafile are proteins.

       --dna  Specify that all sequences in msafile are DNAs.

       --rna  Specify that all sequences in msafile are RNAs.

OPTIONS CONTROLLING PROFILE CONSTRUCTION

These options control how consensus columns are defined in an alignment.

--fast Define consensus columns as those that have a fraction >= symfrac of residues as
opposed to gaps. (See below for the --symfrac option.) This is the default.

--hand Define consensus columns in next profile using reference annotation to the multiple
alignment. This allows you to define any consensus columns you like.

--symfrac <x>
Define the residue fraction threshold necessary to define a consensus column when
using the --fast option. The default is 0.5. The symbol fraction in each column is
calculated after taking relative sequence weighting into account, and ignoring gap
characters corresponding to ends of sequence fragments (as opposed to internal
insertions/deletions). Setting this to 0.0 means that every alignment column will
be assigned as consensus, which may be useful in some cases. Setting it to 1.0
means that only columns that include 0 gaps (internal insertions/deletions) will be
assigned as consensus.

--fragthresh <x>
We only want to count terminal gaps as deletions if the aligned sequence is known
to be full-length, not if it is a fragment (for instance, because only part of it
was sequenced). HMMER uses a simple rule to infer fragments: if the sequence length
L is less than or equal to a fraction <x> times the alignment length in columns,
then the sequence is handled as a fragment. The default is 0.5. Setting
--fragthresh0 will define no (nonempty) sequence as a fragment; you might want to
do this if you know you've got a carefully curated alignment of full-length
sequences. Setting --fragthresh1 will define all sequences as fragments; you might
want to do this if you know your alignment is entirely composed of fragments, such
as translated short reads in metagenomic shotgun data.

OPTIONS CONTROLLING RELATIVE WEIGHTS

       HMMER uses an ad hoc sequence weighting algorithm to downweight closely related  sequences
       and  upweight  distantly related ones. This has the effect of making models less biased by
       uneven phylogenetic representation. For example, two identical sequences  would  typically
       each  receive  half  the  weight  that  one  sequence  would.  These options control which
       algorithm gets used.

       --wpb  Use the Henikoff position-based sequence weighting scheme [Henikoff  and  Henikoff,
              J. Mol. Biol. 243:574, 1994].  This is the default.

       --wgsc Use  the  Gerstein/Sonnhammer/Chothia  weighting algorithm [Gerstein et al, J. Mol.
              Biol. 235:1067, 1994].

       --wblosum
              Use the same clustering scheme that was used to weight data in  calculating  BLOSUM
              subsitution matrices [Henikoff and Henikoff, Proc. Natl. Acad. Sci 89:10915, 1992].
              Sequences are single-linkage clustered at an identity threshold (default 0.62;  see
              --wid)  and  within each cluster of c sequences, each sequence gets relative weight
              1/c.

       --wnone
              No relative weights. All sequences are assigned uniform weight.

       --wid <x>
              Sets the identity threshold used by single-linkage clustering when using --wblosum.
              Invalid with any other weighting scheme. Default is 0.62.

OTHER OPTIONS

       --informat <s>
              Declare  that  the input msafile is in format <s>.  Currently the accepted multiple
              alignment sequence file formats include Stockholm,  Aligned  FASTA,  Clustal,  NCBI
              PSI-BLAST,  PHYLIP, Selex, and UCSC SAM A2M. Default is to autodetect the format of
              the file.

       --seed <n>
              Seed the random number generator with <n>, an integer >= 0.  If <n> is nonzero, any
              stochastic  simulations  will  be reproducible; the same command will give the same
              results.  If <n> is 0, the random  number  generator  is  seeded  arbitrarily,  and
              stochastic  simulations will vary from run to run of the same command.  The default
              seed is 42.

COPYRIGHT

       Copyright (C) 2015 Howard Hughes Medical Institute.
       Freely distributed under the GNU General Public License (GPLv3).

       For additional information on copyright and licensing, see the file  called  COPYRIGHT  in
       your HMMER source distribution, or see the HMMER web page ().

AUTHOR

       Eddy/Rivas Laboratory
       Janelia Farm Research Campus
       19700 Helix Drive
       Ashburn VA 20147 USA
       http://eddylab.org