Ubuntu Manpage: phyloFit - Fits one or more tree models to a multiple alignment of DNA

NAME

       phyloFit - Fits one or more tree models to a multiple alignment of DNA

DESCRIPTION

       Fits  one  or  more  tree  models  to  a  multiple  alignment  of DNA sequences by maximum
       likelihood, using the specified tree topology and substitution model.   If  categories  of
       sites  are  defined via --features and --catmap (see below), then a separate model will be
       estimated for each category.  A description of each model will be written  to  a  separate
       file,  with  the  suffix  ".mod".   These .mod files minimally include a substitution rate
       matrix, a tree with branch lengths, and estimates of nucleotide  equilibrium  frequencies.
       They may also include information about parameters for modeling rate variation.

SYNOPSIS

       phyloFit [OPTIONS] <msa_fname>

       <msa_fname>  should  be a multiple alignment in FASTA format or one of several alternative
       formats (see --msa-format).  For backward compatibility, this argument may be preceded  by
       '-m'  or  '--msa'.   Note  that  --tree is required in most cases.  By default, all output
       files will have the prefix "phyloFit" (see --out-root).

EXAMPLE

       (If you're like me, you want some basic examples first, and a list of all options later.)

       1. Compute the distance between two aligned sequences (in FASTA file  pair.fa)  under  the
       REV model.

              phyloFit pair.fa

       (output is to phyloFit.mod; distance in substitutions per site appears in the TREE line in
       the output file)

       2. Fit a phylogenetic model to an alignment of human, chimp,  mouse,  and  rat  sequences.
       Use the HKY85 substitution model.  Write output to files with prefix "myfile".

              phyloFit  --tree  "((human,chimp),(mouse,rat))" --subst-mod HKY85 --out-root myfile
              primate-rodent.fa

       3. As above, but use the discrete-gamma model for rate variation, with 4 rate categories.

              phyloFit --tree "((human,chimp),(mouse,rat))" --subst-mod HKY85  --out-root  myfile
              --nrates 4 primate-rodent.fa

       4.  As  above,  but  use  genome-wide  data, stored in the compact "sufficient-statistics"
       format (can be produced with "msa_view -o SS").

              phyloFit --tree "((human,chimp),(mouse,rat))" --subst-mod HKY85  --out-root  myfile
              --nrates 4 --msa-format SS primate-rodent.ss

       5.  Fit  a context-dependent phylogenetic model (U2S) to an alignment of human, mouse, and
       rat sequences.  Use an EM algorithm for parameter optimization and relax  the  convergence
       criteria  a  bit  (recommended  with  context-dependent models).  Write a log file for the
       optimization procedure.  Consider only non-overlapping pairs of sites.

              phyloFit  --tree  "(human,(mouse,rat))"  --subst-mod  U2S  --EM   --precision   MED
              --non-overlapping --log u2s.log --out-root hmr-u2s hmr.fa

       6.  As  above,  but  allow overlapping pairs of sites, and compute likelihoods by assuming
       Markov-dependence of columns (see Siepel & Haussler,  2004).   The  EM  algorithm  can  no
       longer be used (optimization will be much slower).

              phyloFit   --tree  "(human,(mouse,rat))"  --subst-mod  U2S  --precision  MED  --log
              u2s-markov.log --markov hmr.fa

       7. Compute a likelihood using parameter estimates obtained in (5)  and  an  assumption  of
       Markov  dependence.  This provides a lower bound on the likelihood of the Markov-dependent
       model.

              phyloFit --init-model hmr-u2s.mod --lnl --markov hmr.fa

       8. Given an alignment  of  several  mammalian  sequences  (mammals.fa),  a  tree  topology
       (tree.nh),  and a set of gene annotations in GFF (genes.gff), fit separate models to sites
       in 1st, 2nd, and 3rd codon positions.  Use the  REV  substitution  model.   Assume  coding
       regions have feature type 'CDS'.

              phyloFit --tree tree.nh --features genes.gff --out-root mammals-rev --catmap "NCATS
              = 3; CDS 1-3" --do-cats 1,2,3 mammals.fa

       (output    will    be     to     mammals-rev.cds-1.mod,     mammals-rev.cds-2.mod,     and
       mammals-rev.cds-3.mod)

OPTIONS

--tree, -t <tree_fname>|<tree_string>

(Required if more than three species, or more than two species and a non-reversible
substitution model, e.g., UNREST, U2, U3) Name of file or literal string defining
tree topology. Tree must be in Newick format, with the label at each leaf equal to
the index or name of the corresponding sequence in the alignment (indexing begins
with 1). Examples: --tree "(1,(2,3))", --tree "(human,(mouse,rat))". Currently,
the topology must be rooted. When a reversible substitution model is used, the
root is ignored during the optimization procedure.

--subst-mod, -s JC69|F81|HKY85|HKY85+Gap|REV|SSREV|UNREST|R2|R2S|U2|U2S|R3|R3S|U3|U3S

(default REV).
Nucleotide substitution model. JC69, F81, HKY85

REV, and UNREST have the usual meanings (see, e.g., Yang, Goldman, and Friday,
1994). SSREV is a strand-symmetric version of REV. HKY85+Gap is an adaptation of
HKY that treats gaps as a fifth character (courtesy of James Taylor). The others,
all considered "context-dependent", are as defined in Siepel and Haussler, 2004.
The options --EM and --precision MED are recommended with context-dependent models
(see below).

--msa-format, -i FASTA|PHYLIP|MPM|MAF|SS (default is to guess format from file contents)
Alignment format. FASTA is as usual. PHYLIP is compatible with the formats used
in the PHYLIP and PAML packages. MPM is the format used by the MultiPipMaker
aligner and some other of Webb Miller's older tools. MAF ("Multiple Alignment
Format") is used by MULTIZ/TBA and the UCSC Genome Browser. SS is a simple format
describing the sufficient statistics for phylogenetic inference (distinct columns
or tuple of columns and their counts). Note that the program "msa_view" can be
used for file conversion.

--out-root, -o <output_fname_root> (default "phyloFit"). Use specified string as root
filename for all files created.

--min-informative, -I <ninf_sites>

Require at least <ninf_sites> "informative" sites -- i.e., sites at which at least
two non-gap and non-missing-data ('N' or '*') characters are present. Default is
50.

--gaps-as-bases, -G Treat alignment gap characters ('-') like ordinary bases. By

default, they are treated as missing data.

--ignore-branches, -b <branches>

Ignore specified branches in likelihood computations and parameter estimation, and
treat the induced subtrees as independent. Can be useful for likelihood ratio
tests. The argument <branches> should be a comma-separated list of nodes in the
tree, indicating the branches above these nodes, e.g., human-chimp,cow-dog. (See
tree_doctor --name-ancestors regarding names for ancestral nodes.) This option
does not currently work with --EM.

--quiet, -q Proceed quietly.

--help, -h

Print this help message. (Options for controlling and monitoring the optimization
procedure)

--lnl, -L

(for use with --init-model) Simply evaluate the log likelihood of the specified
tree model, without performing any further optimization. Can be used with
--post-probs, --expected-subs, and --expected-total-subs.

--EM, -E

Fit model(s) using EM rather than the BFGS quasi-Newton algorithm (the default).

--precision, -p HIGH|MED|LOW

(default HIGH) Level of precision to use in estimating model parameters. Affects
convergence criteria for iterative algorithms: higher precision means more
iterations and longer execution time.

--log, -l <log_fname>

Write log to <log_fname> describing details of the optimization procedure.

--init-model, -M <mod_fname> Initialize with specified tree model. By choosing good

starting values for parameters, it is possible to reduce execution time
dramatically. If this option is chosen, --tree is not allowed. The substitution
model used in the given model will be used unless --subst-mod is also specified.
Note: currently only one mod_fname may be specified; it will be used for all
categories.

--init-random, -r Initialize parameters randomly. Can be used multiple times to test

whether the m.l.e. is real.

--seed, -D <seed>

Provide a random number seed for choosing initial parameter values (usually with
--init-random, though random values are used in some other cases as well). Should
be an integer >=1. If not provided, seed is chosen based on current time.

--init-parsimony, -y Initialize branch lengths using parsimony counts for given data.
Only currently implemented for models with single character state (ie, not di- or
tri-nucleotides). Other --init options such as --init-random or --init-model can
be used in conjunction to initialize substitution matrix parameters.

--print-parsimony, -Y <filename>

Print parsimony score to given file, and quit.
(Does not optimize

or report likelihoods).

--clock, -z Assume a molecular clock in estimation. Causes the distances to all
descendant leaves to be equal for each ancestral node and cuts the number of free
branch-length parameters roughly in half.

--scale-only, -B

(for use with --init-model) Estimate only the scale of the tree, rather than
individual branch lengths (branch proportions fixed). Equilibrium frequencies and
rate-matrix parameters will still be estimated unless --no-freqs and --no-rates are
used.

--scale-subtree, -S <node_name> (for use with --scale-only) Estimate separate scale
factors for subtree beneath identified node and rest of tree. The branch leading
to the subtree is included with the subtree. If ":loss" or ":gain" is appended to
<node_name>, subtree scale is constrained to be greater than or less than
(respectively) scale for rest of tree.

--estimate-freqs, -F

Estimate equilibrium frequencies by maximum likelihood, rather than approximating
them by the relative frequencies in the data. If using the SSREV model, this
option implies --sym-freqs.

--sym-freqs, -W

Estimate equilibrium frequencies, assuming freq(A)=freq(T) and freq(C)=freq(G).
This only works for an alphabet ACGT (and possibly gap). This option implies
--estimate-freqs.

--no-freqs, -f

(for use with --init-model) Do not estimate equilibrium frequencies; just use the
ones from the given tree model.

--no-rates, -n

(for use with --init-model) Do not estimate rate-matrix parameters; just use the
ones from the given tree model.

--ancestor, -A <seqname> Treat specified sequence as the root of the tree. The tree
topology must define this sequence to be a child of the root (in practice, the
branch from the root to the specified sequence will be retained, but will be
constrained to have length zero).

--error, -e <filename>

For each parameter, report estimate, variance, and 95%% confidence interval,
printed to given filename, one parameter per line.

--no-opt, -O <param_list> Hold parameters listed in comma-separated param_list constant at
initial values. This applies only to the "main" model, and not to any models
defined with the --alt-mod option. Param list can contain values such as
"branches" to hold branch lengths constant, "ratematrix", "backgd", or "ratevar" to
hold entire rate matrix, equilibrium frequencies, or rate variation parameters
constant (respectively). There are also substitution model-specific parameters
such as "kappa" (transition/transversion rate ratio). Note: to hold certain
branches constant, but optimize others, put an exclamation point in the
newick-formatted tree after the branch lengths that should be held constant. This
can be useful for enforcing a star-phylogeny. However, note that the two branches
coming from root of tree are treated as one. So they should both be held constant,
or not held constant. This option does *not* work with --scale-only or --clock.

--bound <param_name[lower_bound,upper_bound]> Set boundaries for parameter. lower_bound
or upper_bound may be empty string to keep default. For example --bound
gc_param[1,] will

set the lower bound for gc_param to 1 (keeping upper bound at infinity), for a GC
model. Only applies to parameters for model in the "main" tree, but similar syntax
can be used within the --alt-mod arguments. Can be used multiple times to set
boundaries for different parameters.

--selection <selection_param>

Use selection in the model (is also implied if --init-model is used and contains
selection parameter). Selection scales rate matrix entries by
selection_param/(1-exp(-selection-param)); this is done after rate matrix is scaled
to set expected number of substitutions per unit time to 1. If using codon models
selection acts only on nonysynonymous mutations. (Options for modeling rate
variation)

--nrates, -k <nratecats> (default 1). Number of rate categories to use. Specifying a

value of greater than one causes the discrete gamma model for rate variation to be
used (Yang, 1994).

--alpha, -a <alpha> (for use with --nrates). Initial value for alpha, the shape parameter
of the gamma distribution. Default is 1.

--rate-constants, -K <rate_consts>

Use a non-parameteric mixture model for rates, instead of assuming a gamma
distribution. The argument <rate_consts> must be a comma-delimited list explicitly
defining the rate constants to be used. The "weight" (mixing proportion)
associated with each rate constant will be estimated by EM (this option implies
--EM). If --alpha is used with this option, then the mixing proportions will be
initialized to reflect a gamma distribution with the specified shape parameter.
(Options for separate handling of sites in different annotation categories)

--features, -g <fname>

Annotations file (GFF or BED format) describing features on one or more sequences
in the alignment. Together with a category map (see --catmap), will be taken to
define site categories, and a separate model will be estimated for each category.
If no category map is specified, a category will be assumed for each type of
feature, and they will be numbered in the order of appearance of the features.
Features are assumed to use the coordinate frame of the first sequence in the
alignment and should be non-overlapping (see 'refeature --unique').

--catmap, -c <fname>|<string>

(optionally use with --features) Mapping of feature types to category numbers. Can
either give a filename or an "inline" description of a simple category map, e.g.,
--catmap "NCATS = 3 ; CDS 1-3" or --catmap "NCATS = 1 ; UTR 1". Note that category
0 is reserved for "background" (everything that is not described by a defined
feature type).

--do-cats, -C <cat_list>

(optionally use with --features) Estimate models for only the specified categories
(comma-delimited list categories, by name or numbera). Default is to fit a model
for every category.

--reverse-groups, -R <tag>

(optionally use with --features) Group features by <tag> (e.g., "transcript_id" or
"exon_id") and reverse complement segments of the alignment corresponding to groups
on the reverse strand. Groups must be non-overlapping (see refeature --unique).
Useful with categories corresponding to strand-specific phenomena (e.g., codon
positions). (Options for context-dependent substitution models)

--markov, -N

(for use with context-dependent substitutions models and not available with --EM.)
Assume Markov dependence of alignment columns, and compute the conditional
probability of each column given its N-1 predecessors using the two-pass algorithm
described by Siepel and Haussler (2004). (Here, N is the "order" of the model, as
defined by --subst-mod; e.g., N=1 for REV, N=2 for U2S, N=3 for U3S.) The
alternative (the default) is simply to work with joint probabilities of tuples of
columns. (You can ensure that these tuples are non-overlapping with the
--non-overlapping option.) The use of joint probabilities during parameter
estimation allows the use of the --EM option and can be much faster; in addition,
it appears to produce nearly equivalent estimates. If desired, parameters can be
estimated without --markov, and then the likelihood can be evaluated using --lnl
and --markov together. This gives a lower bound on the likelihood of the
Markov-dependent model.

--non-overlapping, -V (for use with context-dependent substitution models; not compatible
with --markov, --features, or --msa-format SS) Avoid using overlapping tuples of
sites in parameter estimation. If a dinucleotide model is selected, every other
tuple will be considered, and if a nucleotide triplet model is selected, every
third tuple will be considered. This option cannot be used with an alignment
represented only by unordered sufficient statistics. (Option for lineage-specific
models)

--label-branches branch1,branch2,branch3...:label

Create a group of branches by giving a set of branches a single label. The label
should be a word which does not contain special characters (in particular, no
spaces, brackets, parentheses, pound signs, commas, or colons). The label is for
use with --alt-model option below, so that an alternate model can be defined for a
set of branches. A branch is specified by the name of the node which is a
descendant of that branch.

For example, --label-branches hg18,chimp,hg18-chimp:HC will apply the label "HC" to
the hg18,chimp,and hg18-chimp branches in the following tree:
(((hg18,chimp)hg18-chimp, (mouse,rat)mouse-rat)

The same label could be defined without using --label-branches by specifying the
tree (either on the command-line or within init-model) as follows: (((hg18 # HC,
chimp #HC)#HC, (mouse,rat))

--label-subtree node[+]:label Similar to label-branches, except labels the entire subtree
of the named node. If the node name is followed by a "+" sign, then includes the
branch leading up to the node in the subtree.

--alt-model, -d <label:(model|param_list)> Create a lineage-specific substitution model on
a group of branches. The group is defined by a label, which can be specified
within the tree string (using the # sign notation), or by using the
--label-branches or --label-subtree arguments. If the alt-model applies to only a
single branch, labelling is not necessary and the name of the node descending from
the branch can be used instead. See --label-branches above for more details on
labelling groups of branches. If a name of a substitution model (HKY85, REV,
UNREST, etc) is given after the colon, then this model will be used for the group
of branches, and parameters relevant to the model will be estimated separately in
this group. This model may be different (or the same) as the model used in the
rest of the tree, but it must have the same number of states and be of the same
order as the model used for the rest of the tree.

Alternately, a list of parameter names can be given after the colon. In this case,
the same substitution model will be used for the entire tree, but the parameters
listed will be estimated separately in the specified group of branches.

The parameter names are model-specific, and include "kappa" for HKY models, "alpha"
for GC models, "ratematrix" to specify all rate-matrix parameters in general
models, and "backgd" for the equilibrium background frequencies. The parameter
names may optionally be followed by boundaries in square brackets to specify
parameter bounds, as described in --bound option.

The --alt-model argument may be used multiple times, if one wishes to (for example)
optimize a parameter independently on several different groups of branches.
Example: phyloFit align.fa --subst-mod HKY85 \ --tree "(human, (mouse#MR,
rat#MR)#MR, cow)"\ --alt-model "MR:kappa[0, 1]"

will estimate the HKY85 parameter kappa separately on the mouse/rat subtree, and
constrain kappa between 0 and 1. The quotes are often necessary to prevent the
square brakcets from being parsed by the shell. The same model could be achieved
with:

phyloFit align.fa --subst-mod HKY85 \ --tree "(human, (mouse,rat)mouse-rat, cow)"\
--label-branches mouse,rat,mouse-rat:MR \ --alt-model "MR:kappa[0,1]" or

phyloFit align.fa --subst-mod HKY85 \ --tree "(human, (mouse,rat)mouse-rat, cow)" \
--label-subtree "mouse-rat+:MR" \ --alt-model "MR:kappa[0,1]"

(Options for posterior probabilities)

--post-probs, -P

Output posterior probabilities of all bases at all ancestral nodes. Output will be
to auxiliary file(s) with suffix ".postprobs".

--expected-subs, -X

Output posterior expected number of substitutions on each branch at each site,
summed across all types of substitutions. Output will be to auxiliary file(s) with
suffix ".expsub".

--expected-subs-col, -J

Output posterior expected number of substitutions of each type on each branch, for
each site. Output will be to auxiliary file(s) with suffix .expcolsub

--expected-total-subs, -Z

Output posterior expected number of substitutions of each type on each branch,
summed across all sites. Output will be to auxiliary file(s) with suffix
".exptotsub".

--column-probs, -U

(for use with -init-model; implies --lnl)
Output a separate log

probability for each type of column in the input.
Output will

be to a file with suffix ".colprobs".
Values are log base 2.

(Options for estimation in sliding window)

--windows, -w <size,shift> Apply a sliding window to the alignment, and fit a separate
tree to each window. Arguments specify size of window and amount by which to shift
it on each iteration, both in bases of the first sequence in the alignment (assumed
to be the reference sequence). Separate versions of all output files will be
created for each window.

--windows-explicit, -v <window_coord_list> Like --windows, except that all start and end
coordinates must be explicitly specified. Each successive pair of numbers is
interpreted as defining the start and end of a window. Can be used with a
two-column file and the '*' operator, e.g., --windows-explicit '*mycoords'.

NAME

DESCRIPTION

SYNOPSIS

EXAMPLE

OPTIONS

SEE ALSO