Provided by: phast_1.4+dfsg-1_amd64 

NAME
exoniphy - Prediction of evolutionarily conserved protein-coding exons using Required argument
<msa_fname> must be a multiple alignment file, in one of several possible formats (see --msa-format).
DESCRIPTION
Prediction of evolutionarily conserved protein-coding exons using a phylogenetic hidden Markov model
(phylo-HMM). By default, a model definition and model parameters are used that are appropriate for exon
prediction in human DNA, based on human/mouse/rat alignments and a 60-state HMM. Using the --hmm,
--tree-models, and --catmap options, however, it is possible to define alternative phylo-HMMs, e.g., for
different sets of species and different phylogenies, or for prediction of exon pairs or complete gene
structures.
OPTIONS
(Model definition and model parameters)
--hmm, -H <fname>
Name of HMM file defining states and transition probabilities. By default, the 60-state HMM
described in Siepel & Haussler (2004) is used, with transition probabilities appropriate for
mammalian genomes (estimated as described in that paper).
--tree-models, -m <fname_list> List of tree model (*.mod) files, one for each state in the HMM. Order of
models must correspond to order of states in HMM file. By default, a set of models appropriate
for human, mouse, and rat are used (estimated as described in Siepel & Haussler, 2004).
--catmap, -c <fname>|<string>
Mapping of feature types to category numbers.
Can give either
a filename or an "inline" description of a simple category map, e.g., --catmap "NCATS = 3 ; CDS
1-3". By default, a category map is used that is appropriate for the 60-state HMM mentioned
above.
--extrapolate, -e <phylog.nh> | default Extrapolate to a larger set of species based on the given
phylogeny (Newick-format). The trees in the given tree models (*.mod files) must be subtrees of
the larger phylogeny. For each tree model M, a copy will be created of the larger phylogeny, then
scaled such that the total branch length of the subtree corresponding to M's tree equals the total
branch length of M's tree; this new version will then be used in place of M's tree. (Any species
name present in this tree but not in the data will be ignored.) If the string "default" is given
instead of a filename, then a phylogeny for 25 vertebrate species, estimated from sequence data
for Target 1 (CFTR) of the NISC Comparative Sequencing Program (Thomas et al., 2003), will be
assumed.
--data-path, -D <path>
Path to the directory with phast data. Exoniphy default models should be in <path>/exoniphy/.
Default is set at compile time.
(Input and output)
--msa-format, -i FASTA|PHYLIP|MPM|MAF|SS
File format of input alignment.
Default is to guess alignment
format from file contents.
--score, -S Report log-odds scores for predictions, equal to their log total probability under an exon
model minus their log total probability under a background model. The exon model can be altered
using --cds-types and --signal-types and the background model can be altered using --backgd-types
(see below).
--seqname, -s <name>
Use specified string as "seqname" field in GFF output. Default is obtained from input file name
(double filename root, e.g., "chr22" if input file is "chr22.35.ss").
--idpref, -p <name>
Use specified string as prefix of generated ids in GFF output. Can be used to ensure ids are
unique. Default is obtained from input file name (single filename root, e.g., "chr22.35" if input
file is "chr22.35.ss").
--grouptag, -g <tag> Use specified string as the tag denoting groups in GFF output (default is
"transcript_id").
--alias, -A <alias_def>
Alias names in input alignment according to given definition, e.g., "hg17=human; mm5=mouse;
rn3=rat". Useful with default tree models and with --extrapolate. (Default models use generic
common names such as "human", "mouse", and "rat". This option allows a mapping to be established
between the leaves of trees in these files and the sequences of an alignment that uses an
alternative naming convention.)
(Altering the states and transition probabilities of the HMM)
--no-cns, -x
Eliminate the state/category for conserved noncoding sequence from the default HMM and category
map. Ignored if non-default HMM and category map are selected.
--reflect-strand, -U
Given an HMM describing the forward strand, create a larger HMM that allows for features on both
strands by "reflecting" the HMM about all states associated with background categories (see
--backgd-cats). The new HMM will be used for predictions on both strands. If the default HMM is
used, then this option will be used automatically.
--bias, -b <val>
Set "coding bias" equal to the specified value (default -3.33 if default HMM is used, 0
otherwise). The coding bias is added to the log probabilities of transitions from background
states to non-background states (see --backgd-cats), then all transition probabilities are
renormalized. If the coding bias is positive, then more predictions will tend to be made and
sensitivity will tend to improve, at some cost to specificity; if it is negative, then fewer
predictions will tend to be made, and specificity will tend to improve, at some cost to
sensitivity.
--sens-spec, -Y <fname-root> Make predictions for a range of different coding biases (see --bias), and
write results to files with given filename root. This allows the sensitivity/specificity tradeoff
to be examined. The range is fixed at -20 to 10, and 10 different sets of predictions are
produced. (Feature types)
--backgd-types, -B <list>
Feature types to be considered "background" (default value: "background,CNS"). Affects
--reflect-strand, --score, and --bias.
--cds-types, -C <list>
(for use with --score) Feature types that represent protein-coding regions (default value: "CDS").
--signal-types, -L <list> (for use with --score) Types of features to be considered "signals" during
scoring (default value: "start_codon,stop_codon,5'splice,3'splice,prestart,cds5'ss,cds3'ss"). One
score is produced for a CDS feature (as defined by --cds-types) and the adjacent signal features;
the score is then assigned to the CDS feature.
(Indels)
--indels, -I
Use the indel model described in Siepel & Haussler (2004).
--no-gaps, -W <list> Prohibit gaps in sites of the specified categories (gaps result in emission
probabilities of zero). If the default category map is used (see --catmap), then gaps are
prohibited in start and stop codons and at the canonical GT and AG positions of splice sites (with
or without --indels). In all other cases, the default behavior is to treat gaps as missing data,
or to address them with the indel model (--indels).
--require-informative, -N <list>
Require "informative" columns (i.e., columns with more than two non-missing-data characters,
excluding sequences specified by --not-informative) in the given categories (list by name or
number). Non-informative columns will be given emission probabilities of zero. If the default
category map is used (see --catmap), then this option applies automatically to CDSs, start and
stop codons, and the canonical GT and AG positions of splice sites. Note that alignment gaps
*are* considered informative; the way they are handled is defined by --indels and --no-gaps.
--not-informative, -n <list>
Do not consider the specified sequences (listed by name) when deciding whether a column is
informative. This option can be useful when sequences are present that are very close to the
reference sequence and thus do not contribute much in the way of phylogenetic information. E.g.,
one might use "--not-informative chimp" with a human-referenced multiple alignment including chimp
sequence.
(Other)
--quiet, -q
Proceed quietly (without messages to stderr).
--help -h Print this help message.
REFERENCES: A. Siepel and D. Haussler. 2004. Computational identification of evolutionarily conserved
exons. Proc. 8th Annual Int'l Conf.
on Research in Computational Biology (RECOMB '04), pp. 177-186. J. Thomas et al. 2003.
Comparative analyses of multi-species sequences from targeted genomic regions. Nature
424:788-793.
exoniphy 1.4 May 2016 EXONIPHY(1)