lunar (1) phyloFit.1.gz

Provided by: phast_1.6+dfsg-3_amd64 bug

NAME

       phyloFit - Fits one or more tree models to a multiple alignment of DNA

DESCRIPTION

       Fits  one  or  more  tree  models  to  a  multiple  alignment  of DNA sequences by maximum
       likelihood, using the specified tree topology and substitution model.   If  categories  of
       sites  are  defined via --features and --catmap (see below), then a separate model will be
       estimated for each category.  A description of each model will be written  to  a  separate
       file,  with  the  suffix  ".mod".   These .mod files minimally include a substitution rate
       matrix, a tree with branch lengths, and estimates of nucleotide  equilibrium  frequencies.
       They may also include information about parameters for modeling rate variation.

SYNOPSIS

       phyloFit [OPTIONS] <msa_fname>

       <msa_fname>  should  be a multiple alignment in FASTA format or one of several alternative
       formats (see --msa-format).  For backward compatibility, this argument may be preceded  by
       '-m'  or  '--msa'.   Note  that  --tree is required in most cases.  By default, all output
       files will have the prefix "phyloFit" (see --out-root).

EXAMPLE

       (If you're like me, you want some basic examples first, and a list of all options later.)

       1. Compute the distance between two aligned sequences (in FASTA file  pair.fa)  under  the
       REV model.

              phyloFit pair.fa

       (output is to phyloFit.mod; distance in substitutions per site appears in the TREE line in
       the output file)

       2. Fit a phylogenetic model to an alignment of human, chimp,  mouse,  and  rat  sequences.
       Use the HKY85 substitution model.  Write output to files with prefix "myfile".

              phyloFit  --tree  "((human,chimp),(mouse,rat))" --subst-mod HKY85 --out-root myfile
              primate-rodent.fa

       3. As above, but use the discrete-gamma model for rate variation, with 4 rate categories.

              phyloFit --tree "((human,chimp),(mouse,rat))" --subst-mod HKY85  --out-root  myfile
              --nrates 4 primate-rodent.fa

       4.  As  above,  but  use  genome-wide  data, stored in the compact "sufficient-statistics"
       format (can be produced with "msa_view -o SS").

              phyloFit --tree "((human,chimp),(mouse,rat))" --subst-mod HKY85  --out-root  myfile
              --nrates 4 --msa-format SS primate-rodent.ss

       5.  Fit  a context-dependent phylogenetic model (U2S) to an alignment of human, mouse, and
       rat sequences.  Use an EM algorithm for parameter optimization and relax  the  convergence
       criteria  a  bit  (recommended  with  context-dependent models).  Write a log file for the
       optimization procedure.  Consider only non-overlapping pairs of sites.

              phyloFit  --tree  "(human,(mouse,rat))"  --subst-mod  U2S  --EM   --precision   MED
              --non-overlapping --log u2s.log --out-root hmr-u2s hmr.fa

       6.  As  above,  but  allow overlapping pairs of sites, and compute likelihoods by assuming
       Markov-dependence of columns (see Siepel & Haussler,  2004).   The  EM  algorithm  can  no
       longer be used (optimization will be much slower).

              phyloFit   --tree  "(human,(mouse,rat))"  --subst-mod  U2S  --precision  MED  --log
              u2s-markov.log --markov hmr.fa

       7. Compute a likelihood using parameter estimates obtained in (5)  and  an  assumption  of
       Markov  dependence.  This provides a lower bound on the likelihood of the Markov-dependent
       model.

              phyloFit --init-model hmr-u2s.mod --lnl --markov hmr.fa

       8. Given an alignment  of  several  mammalian  sequences  (mammals.fa),  a  tree  topology
       (tree.nh),  and a set of gene annotations in GFF (genes.gff), fit separate models to sites
       in 1st, 2nd, and 3rd codon positions.  Use the  REV  substitution  model.   Assume  coding
       regions have feature type 'CDS'.

              phyloFit --tree tree.nh --features genes.gff --out-root mammals-rev --catmap "NCATS
              = 3; CDS 1-3" --do-cats 1,2,3 mammals.fa

       (output    will    be     to     mammals-rev.cds-1.mod,     mammals-rev.cds-2.mod,     and
       mammals-rev.cds-3.mod)

OPTIONS

       --tree, -t <tree_fname>|<tree_string>

              (Required if more than three species, or more than two species and a non-reversible
              substitution model, e.g., UNREST, U2, U3) Name of file or literal  string  defining
              tree topology.  Tree must be in Newick format, with the label at each leaf equal to
              the index or name of the corresponding sequence in the alignment  (indexing  begins
              with  1).   Examples: --tree "(1,(2,3))", --tree "(human,(mouse,rat))".  Currently,
              the topology must be rooted.  When a reversible substitution  model  is  used,  the
              root is ignored during the optimization procedure.

       --subst-mod, -s JC69|F81|HKY85|HKY85+Gap|REV|SSREV|UNREST|R2|R2S|U2|U2S|R3|R3S|U3|U3S

       (default REV).
              Nucleotide substitution model.  JC69, F81, HKY85

              REV,  and  UNREST  have  the  usual meanings (see, e.g., Yang, Goldman, and Friday,
              1994).  SSREV is a strand-symmetric version of REV.  HKY85+Gap is an adaptation  of
              HKY  that treats gaps as a fifth character (courtesy of James Taylor).  The others,
              all considered "context-dependent", are as defined in Siepel  and  Haussler,  2004.
              The  options --EM and --precision MED are recommended with context-dependent models
              (see below).

       --msa-format, -i FASTA|PHYLIP|MPM|MAF|SS (default is to guess format from  file  contents)
              Alignment  format.   FASTA is as usual.  PHYLIP is compatible with the formats used
              in the PHYLIP and PAML packages.  MPM is  the  format  used  by  the  MultiPipMaker
              aligner  and  some  other  of  Webb Miller's older tools.  MAF ("Multiple Alignment
              Format") is used by MULTIZ/TBA and the UCSC Genome Browser.  SS is a simple  format
              describing  the  sufficient statistics for phylogenetic inference (distinct columns
              or tuple of columns and their counts).  Note that the  program  "msa_view"  can  be
              used for file conversion.

       --out-root,  -o  <output_fname_root>  (default  "phyloFit").  Use specified string as root
              filename for all files created.

       --min-informative, -I <ninf_sites>

              Require at least <ninf_sites> "informative" sites -- i.e., sites at which at  least
              two  non-gap  and non-missing-data ('N' or '*') characters are present.  Default is
              50.

       --gaps-as-bases, -G Treat alignment gap characters ('-') like ordinary bases.  By

              default, they are treated as missing data.

       --ignore-branches, -b <branches>

              Ignore specified branches in likelihood computations and parameter estimation,  and
              treat  the  induced  subtrees  as  independent.  Can be useful for likelihood ratio
              tests.  The argument <branches> should be a comma-separated list of  nodes  in  the
              tree,  indicating  the branches above these nodes, e.g., human-chimp,cow-dog.  (See
              tree_doctor --name-ancestors regarding names for  ancestral  nodes.)   This  option
              does not currently work with --EM.

       --quiet, -q Proceed quietly.

       --help, -h

              Print  this help message.  (Options for controlling and monitoring the optimization
              procedure)

       --lnl, -L

              (for use with --init-model) Simply evaluate the log  likelihood  of  the  specified
              tree  model,  without  performing  any  further  optimization.   Can  be  used with
              --post-probs, --expected-subs, and --expected-total-subs.

       --EM, -E

              Fit model(s) using EM rather than the BFGS quasi-Newton algorithm (the default).

       --precision, -p HIGH|MED|LOW

              (default HIGH) Level of precision to use in estimating model  parameters.   Affects
              convergence   criteria  for  iterative  algorithms:  higher  precision  means  more
              iterations and longer execution time.

       --log, -l <log_fname>

              Write log to <log_fname> describing details of the optimization procedure.

       --init-model, -M <mod_fname> Initialize with specified tree model.  By choosing good

              starting  values  for  parameters,  it  is  possible  to  reduce   execution   time
              dramatically.   If  this option is chosen, --tree is not allowed.  The substitution
              model used in the given model will be used unless --subst-mod  is  also  specified.
              Note:  currently  only  one  mod_fname  may  be  specified; it will be used for all
              categories.

       --init-random, -r Initialize parameters randomly.  Can be used multiple times to test

              whether the m.l.e. is real.

       --seed, -D <seed>

              Provide a random number seed for choosing initial parameter  values  (usually  with
              --init-random,  though random values are used in some other cases as well).  Should
              be an integer >=1.  If not provided, seed is chosen based on current time.

       --init-parsimony, -y Initialize branch lengths using  parsimony  counts  for  given  data.
              Only  currently  implemented for models with single character state (ie, not di- or
              tri-nucleotides).  Other --init options such as --init-random or  --init-model  can
              be used in conjunction to initialize substitution matrix parameters.

       --print-parsimony, -Y <filename>

       Print parsimony score to given file, and quit.
              (Does not optimize

              or report likelihoods).

       --clock,  -z  Assume  a  molecular  clock  in  estimation.   Causes  the  distances to all
              descendant leaves to be equal for each ancestral node and cuts the number  of  free
              branch-length parameters roughly in half.

       --scale-only, -B

              (for  use  with  --init-model)  Estimate  only  the  scale of the tree, rather than
              individual branch lengths (branch proportions fixed).  Equilibrium frequencies  and
              rate-matrix parameters will still be estimated unless --no-freqs and --no-rates are
              used.

       --scale-subtree, -S <node_name>  (for  use  with  --scale-only)  Estimate  separate  scale
              factors  for  subtree beneath identified node and rest of tree.  The branch leading
              to the subtree is included with the subtree.  If ":loss" or ":gain" is appended  to
              <node_name>,  subtree  scale  is  constrained  to  be  greater  than  or  less than
              (respectively) scale for rest of tree.

       --estimate-freqs, -F

              Estimate equilibrium frequencies by maximum likelihood, rather  than  approximating
              them  by  the  relative  frequencies  in  the data.  If using the SSREV model, this
              option implies --sym-freqs.

       --sym-freqs, -W

              Estimate equilibrium frequencies,  assuming  freq(A)=freq(T)  and  freq(C)=freq(G).
              This  only  works  for  an  alphabet  ACGT (and possibly gap).  This option implies
              --estimate-freqs.

       --no-freqs, -f

              (for use with --init-model) Do not estimate equilibrium frequencies; just  use  the
              ones from the given tree model.

       --no-rates, -n

              (for  use  with  --init-model) Do not estimate rate-matrix parameters; just use the
              ones from the given tree model.

       --ancestor, -A <seqname> Treat specified sequence as the  root  of  the  tree.   The  tree
              topology  must  define  this  sequence  to be a child of the root (in practice, the
              branch from the root to the specified  sequence  will  be  retained,  but  will  be
              constrained to have length zero).

       --error, -e <filename>

              For  each  parameter,  report  estimate,  variance,  and  95%% confidence interval,
              printed to given filename, one parameter per line.

       --no-opt, -O <param_list> Hold parameters listed in comma-separated param_list constant at
              initial  values.   This  applies  only  to  the "main" model, and not to any models
              defined with  the  --alt-mod  option.   Param  list  can  contain  values  such  as
              "branches" to hold branch lengths constant, "ratematrix", "backgd", or "ratevar" to
              hold entire rate matrix, equilibrium  frequencies,  or  rate  variation  parameters
              constant  (respectively).   There  are  also substitution model-specific parameters
              such as "kappa"  (transition/transversion  rate  ratio).   Note:  to  hold  certain
              branches   constant,   but  optimize  others,  put  an  exclamation  point  in  the
              newick-formatted tree after the branch lengths that should be held constant.   This
              can  be useful for enforcing a star-phylogeny.  However, note that the two branches
              coming from root of tree are treated as one.  So they should both be held constant,
              or not held constant.  This option does *not* work with --scale-only or --clock.

       --bound  <param_name[lower_bound,upper_bound]>  Set boundaries for parameter.  lower_bound
              or  upper_bound  may  be  empty  string  to  keep  default.   For  example  --bound
              gc_param[1,] will

              set  the  lower bound for gc_param to 1 (keeping upper bound at infinity), for a GC
              model.  Only applies to parameters for model in the "main" tree, but similar syntax
              can  be  used  within  the  --alt-mod arguments.  Can be used multiple times to set
              boundaries for different parameters.

       --selection <selection_param>

              Use selection in the model (is also implied if --init-model is  used  and  contains
              selection    parameter).     Selection    scales    rate    matrix    entries    by
              selection_param/(1-exp(-selection-param)); this is done after rate matrix is scaled
              to  set expected number of substitutions per unit time to 1.  If using codon models
              selection acts only  on  nonysynonymous  mutations.   (Options  for  modeling  rate
              variation)

       --nrates, -k <nratecats> (default 1).  Number of rate categories to use.  Specifying a

              value  of greater than one causes the discrete gamma model for rate variation to be
              used (Yang, 1994).

       --alpha, -a <alpha> (for use with --nrates).  Initial value for alpha, the shape parameter
              of the gamma distribution.  Default is 1.

       --rate-constants, -K <rate_consts>

              Use  a  non-parameteric  mixture  model  for  rates,  instead  of  assuming a gamma
              distribution.  The argument <rate_consts> must be a comma-delimited list explicitly
              defining  the  rate  constants  to  be  used.   The  "weight"  (mixing  proportion)
              associated with each rate constant will be estimated by  EM  (this  option  implies
              --EM).   If  --alpha  is used with this option, then the mixing proportions will be
              initialized to reflect a gamma distribution with  the  specified  shape  parameter.
              (Options for separate handling of sites in different annotation categories)

       --features, -g <fname>

              Annotations  file  (GFF or BED format) describing features on one or more sequences
              in the alignment.  Together with a category map (see --catmap), will  be  taken  to
              define  site  categories, and a separate model will be estimated for each category.
              If no category map is specified, a category  will  be  assumed  for  each  type  of
              feature,  and  they  will  be  numbered in the order of appearance of the features.
              Features are assumed to use the coordinate frame  of  the  first  sequence  in  the
              alignment and should be non-overlapping (see 'refeature --unique').

       --catmap, -c <fname>|<string>

              (optionally use with --features) Mapping of feature types to category numbers.  Can
              either give a filename or an "inline" description of a simple category  map,  e.g.,
              --catmap "NCATS = 3 ; CDS 1-3" or --catmap "NCATS = 1 ; UTR 1".  Note that category
              0 is reserved for "background" (everything that  is  not  described  by  a  defined
              feature type).

       --do-cats, -C <cat_list>

              (optionally  use with --features) Estimate models for only the specified categories
              (comma-delimited list categories, by name or numbera).  Default is to fit  a  model
              for every category.

       --reverse-groups, -R <tag>

              (optionally  use with --features) Group features by <tag> (e.g., "transcript_id" or
              "exon_id") and reverse complement segments of the alignment corresponding to groups
              on  the  reverse  strand.  Groups must be non-overlapping (see refeature --unique).
              Useful with categories corresponding  to  strand-specific  phenomena  (e.g.,  codon
              positions).  (Options for context-dependent substitution models)

       --markov, -N

              (for  use with context-dependent substitutions models and not available with --EM.)
              Assume  Markov  dependence  of  alignment  columns,  and  compute  the  conditional
              probability  of each column given its N-1 predecessors using the two-pass algorithm
              described by Siepel and Haussler (2004).  (Here, N is the "order" of the model,  as
              defined  by  --subst-mod;  e.g.,  N=1  for  REV,  N=2  for  U2S,  N=3 for U3S.) The
              alternative (the default) is simply to work with joint probabilities of  tuples  of
              columns.    (You  can  ensure  that  these  tuples  are  non-overlapping  with  the
              --non-overlapping  option.)   The  use  of  joint  probabilities  during  parameter
              estimation  allows  the use of the --EM option and can be much faster; in addition,
              it appears to produce nearly equivalent estimates.  If desired, parameters  can  be
              estimated  without  --markov,  and then the likelihood can be evaluated using --lnl
              and --markov together.   This  gives  a  lower  bound  on  the  likelihood  of  the
              Markov-dependent model.

       --non-overlapping,  -V (for use with context-dependent substitution models; not compatible
              with --markov, --features, or --msa-format SS) Avoid using  overlapping  tuples  of
              sites  in  parameter  estimation.  If a dinucleotide model is selected, every other
              tuple will be considered, and if a nucleotide  triplet  model  is  selected,  every
              third  tuple  will  be  considered.   This  option cannot be used with an alignment
              represented only by unordered sufficient statistics.  (Option for  lineage-specific
              models)

       --label-branches branch1,branch2,branch3...:label

              Create  a  group of branches by giving a set of branches a single label.  The label
              should be a word which does not  contain  special  characters  (in  particular,  no
              spaces,  brackets,  parentheses, pound signs, commas, or colons).  The label is for
              use with --alt-model option below, so that an alternate model can be defined for  a
              set  of  branches.   A  branch  is  specified  by  the  name of the node which is a
              descendant of that branch.

              For example, --label-branches hg18,chimp,hg18-chimp:HC will apply the label "HC" to
              the     hg18,chimp,and    hg18-chimp    branches    in    the    following    tree:
              (((hg18,chimp)hg18-chimp, (mouse,rat)mouse-rat)

              The same label could be defined without using --label-branches  by  specifying  the
              tree  (either  on  the command-line or within init-model) as follows: (((hg18 # HC,
              chimp #HC)#HC, (mouse,rat))

       --label-subtree node[+]:label Similar to label-branches, except labels the entire  subtree
              of  the  named node.  If the node name is followed by a "+" sign, then includes the
              branch leading up to the node in the subtree.

       --alt-model, -d <label:(model|param_list)> Create a lineage-specific substitution model on
              a  group  of  branches.   The  group  is defined by a label, which can be specified
              within  the  tree  string  (using  the  #  sign  notation),   or   by   using   the
              --label-branches  or --label-subtree arguments.  If the alt-model applies to only a
              single branch, labelling is not necessary and the name of the node descending  from
              the  branch  can  be  used instead.  See --label-branches above for more details on
              labelling groups of branches.  If a name  of  a  substitution  model  (HKY85,  REV,
              UNREST,  etc)  is given after the colon, then this model will be used for the group
              of branches, and parameters relevant to the model will be estimated  separately  in
              this  group.   This  model  may be different (or the same) as the model used in the
              rest of the tree, but it must have the same number of states and  be  of  the  same
              order as the model used for the rest of the tree.

              Alternately, a list of parameter names can be given after the colon.  In this case,
              the same substitution model will be used for the entire tree,  but  the  parameters
              listed will be estimated separately in the specified group of branches.

              The parameter names are model-specific, and include "kappa" for HKY models, "alpha"
              for GC models, "ratematrix"  to  specify  all  rate-matrix  parameters  in  general
              models,  and  "backgd"  for  the equilibrium background frequencies.  The parameter
              names may optionally be followed  by  boundaries  in  square  brackets  to  specify
              parameter bounds, as described in --bound option.

              The --alt-model argument may be used multiple times, if one wishes to (for example)
              optimize a  parameter  independently  on  several  different  groups  of  branches.
              Example:   phyloFit   align.fa  --subst-mod  HKY85  \  --tree  "(human,  (mouse#MR,
              rat#MR)#MR, cow)"\ --alt-model "MR:kappa[0, 1]"

              will estimate the HKY85 parameter kappa separately on the  mouse/rat  subtree,  and
              constrain  kappa  between  0  and 1.  The quotes are often necessary to prevent the
              square brakcets from being parsed by the shell.  The same model could  be  achieved
              with:

              phyloFit  align.fa --subst-mod HKY85 \ --tree "(human, (mouse,rat)mouse-rat, cow)"\
              --label-branches mouse,rat,mouse-rat:MR \ --alt-model "MR:kappa[0,1]" or

              phyloFit align.fa --subst-mod HKY85 \ --tree "(human, (mouse,rat)mouse-rat, cow)" \
              --label-subtree "mouse-rat+:MR" \ --alt-model "MR:kappa[0,1]"

              (Options for posterior probabilities)

       --post-probs, -P

              Output posterior probabilities of all bases at all ancestral nodes.  Output will be
              to auxiliary file(s) with suffix ".postprobs".

       --expected-subs, -X

              Output posterior expected number of substitutions on  each  branch  at  each  site,
              summed across all types of substitutions.  Output will be to auxiliary file(s) with
              suffix ".expsub".

       --expected-subs-col, -J

              Output posterior expected number of substitutions of each type on each branch,  for
              each site.  Output will be to auxiliary file(s) with suffix .expcolsub

       --expected-total-subs, -Z

              Output  posterior  expected  number  of  substitutions of each type on each branch,
              summed across  all  sites.   Output  will  be  to  auxiliary  file(s)  with  suffix
              ".exptotsub".

       --column-probs, -U

       (for use with -init-model; implies --lnl)
              Output a separate log

       probability for each type of column in the input.
              Output will

       be to a file with suffix ".colprobs".
              Values are log base 2.

              (Options for estimation in sliding window)

       --windows,  -w  <size,shift>  Apply  a sliding window to the alignment, and fit a separate
              tree to each window.  Arguments specify size of window and amount by which to shift
              it on each iteration, both in bases of the first sequence in the alignment (assumed
              to be the reference sequence).  Separate versions  of  all  output  files  will  be
              created for each window.

       --windows-explicit,  -v  <window_coord_list> Like --windows, except that all start and end
              coordinates must be explicitly specified.   Each  successive  pair  of  numbers  is
              interpreted  as  defining  the  start  and  end  of  a  window.  Can be used with a
              two-column file and the '*' operator, e.g., --windows-explicit '*mycoords'.

SEE ALSO

       A. Siepel and D. Haussler.
              2004.  Phylogenetic estimation of context-dependent substitution rates  by  maximum
              likelihood.  Mol. Biol. Evol., 21:468-488.

       Z. Yang, N. Goldman, and A. Friday.
              1994.  Comparison  of  models  for nucleotide substution used in maximum likelihood
              phylogenetic estimation. Mol. Biol. Evol., 11:316-324.

       Z. Yang. 1994. Maximum likelihood phylogenetic estimation from
              DNA sequences with variable rates over sites: approximate methods. J.  Mol.  Evol.,
              39:306-314.