lunar (1) alimask.1.gz

Provided by: hmmer_3.3.2+dfsg-1_amd64 bug

NAME

       alimask - calculate and add column mask to a multiple sequence alignment

SYNOPSIS

       alimask [options] msafile postmsafile

DESCRIPTION

       alimask  is  used to apply a mask line to a multiple sequence alignment, based on provided
       alignment or model coordinates.  When hmmbuild receives a masked alignment  as  input,  it
       produces  a  profile model in which the emission probabilities at masked positions are set
       to match the background frequency, rather than being set based on observed frequencies  in
       the  alignment.   Position-specific  insertion and deletion rates are not altered, even in
       masked regions.  alimask autodetects input  format,  and  produces  masked  alignments  in
       Stockholm format.  msafile may contain only one sequence alignment.

       A  common  motivation  for  masking a region in an alignment is that the region contains a
       simple tandem repeat that is observed to cause an unacceptably high rate of false positive
       hits.

       In  the  simplest  case,  a  mask  range  is  given  in  coordinates relative to the input
       alignment, using --alirange <s>.  However it is more often the case that the region to  be
       masked  has  been  identified  in coordinates relative to the profile model (e.g. based on
       recognizing a simple repeat pattern in false hit alignments or in the HMM logo).  Not  all
       alignment columns are converted to match state positions in the profile (see the --symfrac
       flag for hmmbuild for discussion), so model positions  do  not  necessarily  match  up  to
       alignment  column  positions.   To  remove  the  burden  of  converting model positions to
       alignment positions, alimask accepts the mask range input in model  coordinates  as  well,
       using  --modelrange  <s>.   When  using  this  flag,  alimask  determines  which alignment
       positions would be identified by hmmbuild as match states, a process  that  requires  that
       all  hmmbuild flags impacting that decision be supplied to alimask.  It is for this reason
       that many of the hmmbuild flags are also used by alimask.

OPTIONS

       -h     Help; print a brief reminder of command line usage and all available options.

       -o <f> Direct the summary output to file <f>, rather than to stdout.

OPTIONS FOR SPECIFYING MASK RANGE

       A single mask range is given  as  a  dash-separated  pair,  like  --modelrange  10-20  and
       multiple ranges may be submitted as a comma-separated list, --modelrange 10-20,30-42.

       --modelrange <s>
              Supply the given range(s) in model coordinates.

       --alirange <s>
              Supply the given range(s) in alignment coordinates.

       --apendmask
              Add to the existing mask found with the alignment.  The default is to overwrite any
              existing mask.

       --model2ali <s>
              Rather than actually produce the masked  alignment,  simply  print  model  range(s)
              corresponding to input alignment range(s).

       --ali2model <s>
              Rather  than actually produce the masked alignment, simply print alignment range(s)
              corresponding to input model range(s).

OPTIONS FOR SPECIFYING THE ALPHABET

       --amino
              Assert that sequences in msafile are protein, bypassing alphabet autodetection.

       --dna  Assert that sequences in msafile are DNA, bypassing alphabet autodetection.

       --rna  Assert that sequences in msafile are RNA, bypassing alphabet autodetection.

OPTIONS CONTROLLING PROFILE CONSTRUCTION

       These options control how consensus columns are defined in an alignment.

       --fast Define consensus columns as those that have a fraction >= symfrac  of  residues  as
              opposed to gaps. (See below for the --symfrac option.) This is the default.

       --hand Define consensus columns in next profile using reference annotation to the multiple
              alignment.  This allows you to define any consensus columns you like.

       --symfrac <x>
              Define the residue fraction threshold necessary to define a consensus  column  when
              using  the --fast option. The default is 0.5. The symbol fraction in each column is
              calculated after taking relative sequence weighting into account, and ignoring  gap
              characters  corresponding  to  ends  of  sequence fragments (as opposed to internal
              insertions/deletions).  Setting this to 0.0 means that every alignment column  will
              be  assigned  as  consensus,  which  may be useful in some cases. Setting it to 1.0
              means that only columns that include 0 gaps (internal insertions/deletions) will be
              assigned as consensus.

       --fragthresh <x>
              We  only  want to count terminal gaps as deletions if the aligned sequence is known
              to be full-length, not if it is a fragment (for instance, because only part  of  it
              was sequenced). HMMER uses a simple rule to infer fragments: if the sequence length
              L is less than or equal to a fraction <x> times the alignment  length  in  columns,
              then  the  sequence  is  handled  as  a  fragment.  The  default  is  0.5.  Setting
              --fragthresh 0 will define no (nonempty) sequence as a fragment; you might want  to
              do  this  if  you  know  you've  got  a  carefully curated alignment of full-length
              sequences.  Setting --fragthresh 1 will define  all  sequences  as  fragments;  you
              might want to do this if you know your alignment is entirely composed of fragments,
              such as translated short reads in metagenomic shotgun data.

OPTIONS CONTROLLING RELATIVE WEIGHTS

       HMMER uses an ad hoc sequence weighting algorithm to downweight closely related  sequences
       and  upweight  distantly related ones. This has the effect of making models less biased by
       uneven phylogenetic representation. For example, two identical sequences  would  typically
       each  receive  half  the  weight  that  one  sequence  would.  These options control which
       algorithm gets used.

       --wpb  Use the Henikoff position-based sequence weighting scheme [Henikoff  and  Henikoff,
              J. Mol. Biol. 243:574, 1994].  This is the default.

       --wgsc Use  the  Gerstein/Sonnhammer/Chothia  weighting algorithm [Gerstein et al, J. Mol.
              Biol. 235:1067, 1994].

       --wblosum
              Use the same clustering scheme that was used to weight data in  calculating  BLOSUM
              substitution  matrices  [Henikoff  and  Henikoff,  Proc.  Natl. Acad. Sci 89:10915,
              1992]. Sequences are single-linkage clustered at  an  identity  threshold  (default
              0.62;  see  --wid)  and  within  each  cluster  of  c sequences, each sequence gets
              relative weight 1/c.

       --wnone
              No relative weights. All sequences are assigned uniform weight.

       --wid <x>
              Sets the identity threshold used by single-linkage clustering when using --wblosum.
              Invalid with any other weighting scheme. Default is 0.62.

OTHER OPTIONS

       --informat <s>
              Assert   that   input   msafile  is  in  alignment  format  <s>,  bypassing  format
              autodetection.  Common choices for <s>  include:  stockholm,  a2m,  afa,  psiblast,
              clustal, phylip.  For more information, and for codes for some less common formats,
              see main documentation.  The string <s> is case-insensitive (a2m or A2M both work).

       --outformat <s>
              Write the output postmsafile in alignment  format  <s>.   Common  choices  for  <s>
              include:  stockholm,  a2m, afa, psiblast, clustal, phylip.  The string <s> is case-
              insensitive (a2m or A2M both work).  Default is stockholm.

       --seed <n>
              Seed the random number generator with <n>, an integer >= 0.  If <n> is nonzero, any
              stochastic  simulations  will  be reproducible; the same command will give the same
              results.  If <n> is 0, the random  number  generator  is  seeded  arbitrarily,  and
              stochastic  simulations will vary from run to run of the same command.  The default
              seed is 42.

SEE ALSO

       See hmmer(1) for a master man page with a  list  of  all  the  individual  man  pages  for
       programs in the HMMER package.

       For  complete  documentation,  see  the  user guide that came with your HMMER distribution
       (Userguide.pdf); or see the HMMER web page (http://hmmer.org/).

       Copyright (C) 2020 Howard Hughes Medical Institute.
       Freely distributed under the BSD open source license.

       For additional information on copyright and licensing, see the file  called  COPYRIGHT  in
       your HMMER source distribution, or see the HMMER web page (http://hmmer.org/).

AUTHOR

       http://eddylab.org