lunar (1) phyloP.1.gz

Provided by: phast_1.6+dfsg-3_amd64 bug

NAME

       phyloP  -  Compute  conservation  or  acceleration  p-values based on an alignment and The
       phylogenetic model must be in the .mod format  produced  by  the  phyloFit  program.   The
       alignment  file can be in any of several file formats (see --msa-format).  No alignment is
       required with the --null option.

DESCRIPTION

       Compute conservation or acceleration p-values based on an alignment and a model of neutral
       evolution.   Will  also  compute p-values of conservation/acceleration in a subtree and in
       its complementary supertree given  the  whole  tree  (see  --subtree).   P-values  can  be
       produced  for  entire  input  alignments  (the default), pre-specified intervals within an
       alignment (see --features), or individual sites (see --wig-scores and --base-by-base).

       The default  behavior  is  to  compute  a  null  distribution  for  the  total  number  of
       substitutions  from  the  tree model, an estimate of the number of substitutions that have
       actually occurred, and the p-value of this estimate  wrt  the  null  distribution.   These
       computations  are  performed  as  described  by  Siepel, Pollard, and Haussler (2006).  In
       addition to the SPH method,  phyloP  can  compute  p-values  or  conservation/acceleration
       scores  using a likelihood ratio test (--method LRT), a score-based test (--method SCORE),
       or a procedure similar to that used by GERP (Cooper et al., 2005) (--method GERP).   These
       alternative  methods  are  currently  supported only with --base-by-base, --wig-scores, or
       --features.

       The main advantage of the SPH  method  is  that  it  can  provide  a  complete  and  exact
       description   of   distributions  over  numbers  of  substitutions.   However,  simulation
       experiments suggest that the LRT and SCORE methods have somewhat better power than SPH for
       identifying  selection,  especially  when  the  expected  number of substitutions is small
       (e.g., with short branch lengths and/or  short  intervals/individual  sites).   These  two
       methods  are also faster.  They are generally similar to one another in power, but in many
       cases SCORE is considerably faster than LRT.  On the other hand,  SCORE  appears  to  have
       slightly  less  power  than  LRT  at  low false positive rates, i.e., for cases of extreme
       selection.   Thus,  when  using  --base-by-base,  --wig-scores,  or  --features,  LRT   is
       recommended for most purposes, but SCORE is a good alternative if speed is an issue.  When
       computing p-values with the SPH method, the default  is  to  use  the  posterior  expected
       number  of  substitutions  as  an  estimate  of the actual number.  This is a conservative
       estimate, because it is biased toward the mean of the  null  distribution  by  the  prior.
       These  p-values  can be made less conservative with --fit-model and more conservative with
       --confidence-interval (see below).

EXAMPLE

       1. Using the SPH method, compute and report p-values of conservation and acceleration  for
       a  given  alignment  with  respect  to a neutral model of evolution.  Estimated numbers of
       substitutions are also reported.

              phyloP neutral.mod alignment.fa > report.txt

       The file neutral.mod could be produced by running phyloFit on data from ancestral  repeats
       or fourfold degenerate sites with an appropriate tree topology and substitution model.

       2.  Compute  and report p-values of conservation and acceleration for a particular subtree
       of interest (using SPH).

              phyloP --subtree human-mouse_lemur neutral.mod alignment.fa > report.txt

       Here human-mouse_lemur denote the most recent common ancestor of  human  and  mouse_lemur,
       which  is  the  node  that  defines  the primate clade in this phylogeny.  The tree_doctor
       program with the --name-ancestors option can be used to assign names to ancestral nodes of
       the tree.

       3.  Describe  the  complete  null distribution over the number of substitutions for a 10bp
       alignment given the specified neutral model (using SPH).

              phyloP --null 10 neutral.mod > null.txt

       A two-column table is produced with numbers of substitutions and their  probabilities,  up
       to an appropriate upper limit.

       4.  Describe  the  complete  posterior  distribution over the number of substitutions in a
       given alignment (using SPH).

              phyloP --posterior neutral.mod alignment.fa > posterior.txt

       5. Compute conservation scores (-log10 p-values) for each site in an alignment and  output
       them          in          the          fixed-step          wig         format         (see
       http://genome.ucsc.edu/goldenPath/help/wiggle.html).  Use the likelihood ratio test  (LRT)
       method.

              phyloP --wig-scores --method LRT neutral.mod alignment.fa > scores.wig

       The  --mode  option  can  be  used instead to produce acceleration scores (ACC), scores of
       nonneutrality (NNEUT), or scores that summarize conservation  and  acceleration  (CONACC).
       The  --base-by-base  option  can  be  used  to  output  additional  statistics of interest
       (estimated scale factors, log10 likelihood ratios, etc.).   As  discussed  above,  several
       arguments to --method are possible.

       6. Similarly, compute scores describing lineage-specific conservation in primates.

              phyloP   --wig-scores   --method   LRT   --subtree   human-mouse_lemur  neutral.mod
              alignment.fa > scores.wig

       7. Compute conservation p-values and associated statistics for each element in a BED file.
       This  time  use  a score test and allow for acceleration as well as conservation, flagging
       elements under acceleration by making their p-values negative (CONACC mode).

              phyloP  --features  elements.bed   --method   SCORE   --mode   CONACC   neutral.mod
              alignment.fa > element-scores.txt

       This  option  can  also  be  used  with --subtree.  The --gff-scores option can be used to
       output the original features in GFF format with scores equal to -log10 p.  Note  that  the
       input file can be in GFF instead of BED format.

OPTIONS

       --msa-format, -i FASTA|PHYLIP|MPM|MAF|SS

              Alignment format (default is to guess format from file contents).

       --method, -m SPH|LRT|SCORE|GERP

              Method  used to compute p-values or conservation/acceleration scores (Default SPH).
              The likelihood ratio test (LRT) and score test (SCORE) compare an alternative model
              having  a  free  scale  parameter with the given neutral model, or, if --subtree is
              used, an alternative model having free  scale  parameters  for  the  supertree  and
              subtree  with  a  null  model  having  a single free scale parameter.  P-values are
              computed  by  comparing   test   statistics   with   asymptotic   chi-square   null
              distributions.   The  GERP-like  method  (GERP)  estimates  the number of "rejected
              substitutions" per base by comparing the  (per-site)  maximum  likelihood  expected
              number  of  substitutions  with  the  expected  number  under  the  neutral  model.
              Currently LRT, SCORE, and GERP can be used only with --base-by-base,  --wig-scores,
              or --features.

       --wig-scores, -w

              Compute  separate  p-values  per  site, and then compute site-specific conservation
              (acceleration) scores as -log10(p).  Output base-by-base scores in  fixed-step  wig
              format,  using  the coordinate system of the reference sequence (see --refidx).  In
              GERP mode, outputs rejected substitutions per site instead of -log10 p-values.

       --base-by-base, -b

              Like --wig-scores, but outputs multiple values per site, in a method-dependent way.
              With  'SPH',  output includes mean and variance of posterior distribution, with LRT
              and SCORE it includes the estimated scale factor(s) and test statistics,  and  with
              GERP  it  includes  the  estimated  numbers  of  neutral,  observed,  and  rejected
              substitutions, along with the number of species available at each site.

       --refidx, -r <refseq_idx>

              (for use with --wig-scores or --base-by-base) Use  coordinate  frame  of  specified
              sequence  in  output.  Default value is 1, first sequence in alignment; 0 indicates
              coordinate frame of entire multiple alignment.

       --mode, -o CON|ACC|NNEUT|CONACC

              (For use with --wig-scores,  --base-by-base,  or  --features)  Whether  to  compute
              one-sided   p-values  so  that  small  p  (large  -log10  p)  indicates  unexpected
              conservation (CON; the default) or acceleration (ACC); or two-sided  p-values  such
              that small p indicates an unexpected departure from neutrality (NNEUT).  The fourth
              option (CONACC) uses positive values (p-values or scores) to indicate  conservation
              and  negative  values  to indicate acceleration.  In GERP mode, CON and CONACC both
              report the number of rejected substitutions R (which may be  negative),  while  ACC
              reports -R, and NNEUT reports abs(R).

       --features, -f <file>

              Read  features  from  <file> (GFF or BED format) and output a table of p-values and
              related statistics with one row per feature.  The features are assumed to  use  the
              coordinate  frame  of the first sequence in the alignment.  Not for use with --null
              or --posterior.  See also --gff-scores.

       --gff-scores, -g

              (For use with features) Instead of a table, output a GFF and assign each feature  a
              score equal to its -log10 p-value.

       --subtree, -s <node-name>

              (Not  available  in GERP mode) Partition the tree into the subtree beneath the node
              whose   name   is   given   and   the   complementary   supertree,   and   consider
              conservation/acceleration in the subtree given the supertree.  The branch above the
              specified  node  is  included   with   the   subtree.    Thus,   given   the   tree
              "((human,chimp)primate,(mouse,rat)rodent)",  the  option  "--subtree  primate" will
              create one partition consisting of human, chimp, and the branch  leading  to  them,
              and  another  partition  consisting of the rest of the tree; "--subtree human" will
              create one partition consisting only of human and the  branch  leading  to  it  and
              another  partition consisting of the rest of the tree.  In 'SPH' mode, a reversible
              substitution model is assumed.

       --branch, -B <node-name(s)>

              (Not available in GERP or SPH mode).  Like subtree, but partitions  the  tree  into
              the  set  of  named  branches (each named by its child node), and all the remaining
              branches.  Then tests for conservation/ acceleration in the set of  named  branches
              relative to the others.  The argument is a comma-delimited list of child nodes.

       --chrom, -N <name>

              (Optionally  use  with  --wig-scores  or  --base-by-base)  Chromosome  name for wig
              output.  Default is root of multiple alignment filename.

       --log, -l <fname>

              Write log to <fname> describing details  of  parameter  optimization.   Useful  for
              debugging.  (Warning: may produce large file.)

       --seed, -d <seed>

              Provide a random number seed, should be an integer >=1.  Random numbers are used in
              some cases to generate starting values for optimization.  If not specified will use
              a seed based on the current time.

       --no-prune,-P

              Do  not  prune  species  from tree which are not in alignment.  Rather, treat these
              species as having missing data in the alignment.  Missing data does have an  effect
              on the results when --method SPH is used.

       --help, -h

              Produce this help message.

   Options for SPH mode only
       --null,  -n  <nsites>  Compute  just  the  null  (prior)  distribution  of  the  number of
              substitutions, as defined by the tree model and the  given  number  of  sites,  and
              output  as  a  table.   The  'alignment'  argument  will  be ignored.  If used with
              --subtree, the joint distribution over the number of substitutions in the specified
              supertree and subtree will be output instead.

       --posterior,  -p  Compute  just the posterior distribution of the number of substitutions,
              given the alignment and the model, and output as a table.  If used with  --subtree,
              the  joint distribution over the number of substitutions in the specified supertree
              and subtree will be output instead.

       --fit-model, -F

              Fit model to data before computing posterior distribution, by  estimating  a  scale
              factor  for  the  whole  tree  or  (if  --subtree)  separate  scale factors for the
              specified subtree and supertree.  Makes p-values less  conservative.   This  option
              has  no effect with --null and currently cannot be used with --features.  It can be
              used with --wig-scores and --base-by-base.

       --epsilon, -e <val>

              (Default 1e-10 or  1e-6  if  --wig-scores  or  --base-by-base)  Threshold  used  in
              truncating  tails  of  distributions;  tail  probabilities less than this value are
              discarded.  To get accurate p-values smaller than 1e-10, this option will  need  to
              be  used,  at some cost in speed.  Note that truncation affects only *right* tails,
              not left tails, so it should be an issue only with p-values of acceleration.

       --confidence-interval, -c <val>

              Allow for uncertainty in the estimate of the  actual  number  of  substitutions  by
              using a (central) confidence interval about the mean of the specified size (0 < val
              < 1).  To be conservative, the maximum of this interval is used  when  computing  a
              p-value  of  conservation,  and  the  minimum  is  used when computing a p-value of
              acceleration.   The  variance  of  the  posterior  is  computed  exactly,  but  the
              confidence  interval is based on the assumption that the combined distribution will
              be approximately normal (true for large numbers of sites by central limit theorem).

       --quantiles, -q

              (For use with --null or --posterior) Report quantiles of distribution  rather  than
              whole distribution.

SEE ALSO

       Cooper  GM, Stone EA, Asimenos G, NISC Comparative Sequencing Program, Green ED, Batzoglou
       S, Sidow A. Distribution and  intensity  of  constraint  in  mammalian  genomic  sequence.
       Genome Res. 2005 15(7):901-13.

       Siepel  A,  Pollard  KS,  and  Haussler  D.  New  methods  for  detecting lineage-specific
       selection.  In  Proceedings  of  the  10th  International  Conference   on   Research   in
       Computational Molecular Biology (RECOMB 2006), pp. 190-205.