Ubuntu Manpage: phastCons - Identify conserved elements or produce conservation scores, given

NAME

       phastCons - Identify conserved elements or produce conservation scores, given

SYNOPSIS

       The  alignment  file  can  be  in  any  of  several  file formats (see --msa-format).  The
       phylogenetic models must be in the .mod format produced by the phyloFit program.

DESCRIPTION

Identify conserved elements or produce conservation scores, given a multiple alignment and
a phylo-HMM. By default, a phylo-HMM consisting of two states is assumed: a "conserved"
state and a "non-conserved" state. Separate phylogenetic models can be specified for
these two states, e.g.,

phastCons myfile.ss cons.mod,noncons.mod > scores.wig

or a single model can be given for the non-conserved state, e.g.,

phastCons myfile.ss --rho 0.5 noncons.mod > scores.wig

in which case the model for the conserved state will be obtained by multiplying all branch
lengths by the scaling parameter rho (0 < rho < 1). If the --rho option is not used, rho
will be set to its default value of 0.3.

By default, the phylogenetic models will be left unaltered, but if the --estimate-trees
option is used, e.g.,

phastCons myfile.ss init.mod --estimate-trees newtree > scores.wig

then the phylogenetic models for the two states will be estimated from the data, and the
given tree model (there must be only one in this case) will be used for initialization
only. It is also possible to estimate only the scale factor --rho, using the
--estimate-rho option. The transition probabilities for the HMM can either be specified
at the command line or estimated from the data using an EM algorithm. To specify them at
the command line, use either the --transitions option or the --target-coverage and
--expected-length options. The recommended method is to use --target-coverage and
--expected-length, e.g.,

phastCons --target-coverage 0.25 --expected-length 12 myfile.ss
cons.mod,noncons.mod > scores.wig

The program produces two main types of output.
The primary output, sent to stdout in fixed-step WIG format
(http://genome.ucsc.edu/goldenPath/help/wiggle.html), is a set of base-by-base
conservation scores. The score at each base is equal to the posterior probability that
that base was "generated" by the conserved state of the phylo-HMM. The scores are
reported in the coordinate frame of a designated reference sequence (see --refidx), which
is by default the first sequence in the alignment. They can be suppressed with the
--no-post-probs option. The secondary type of output, activated with the --most-conserved
(aka --viterbi) option, is a set of discrete conserved elements. These elements are
output in either BED or GFF format, also in the coordinate system of the reference
sequence (see --most-conserved). They can be assigned log-odds scores using the --score
option.

Other uses are also supported, but will not be described in detail here. For example, it
is possible to produce conservation scores and conserved elements using a k-state
phylo-HMM of the kind described by Felsenstein and Churchill (1996) (see --FC), and it is
possible to produce a "coding potential" score instead of a conservation score (see
--coding-potential). It is also possible to give the program a custom HMM and to specify
any subset of its states to use for prediction (see --hmm and --states).

See the phastCons HOWTO for additional details.

EXAMPLE

1. Given phylogenetic models for conserved and nonconserved regions and HMM transition
parameters, compute a set of conservation scores.

phastCons --transitions 0.01,0.01 mydata.ss cons.mod,noncons.mod > scores.wig

2. Similar to (1), but define the conserved model as a scaled version of the nonconserved
model, with rho=0.4 as the scaling parameter. Also predict conserved elements as well as
conservation scores, and assign log-odds scores to predictions.

phastCons --transitions 0.01,0.01 --most-conserved mostcons.bed --score --rho 0.4
mydata.ss noncons.mod > scores.wig

(if output file were "mostcons.gff," then output would be in GFF instead of BED format)

3. This time, estimate the parameter rho from the data. Suppress both the scores and the
conserved elements. Specify the transition probabilities using --target-coverage and
--expected-length instead of --transitions.

phastCons --target-coverage 0.25 --expected-length 12 --estimate-rho newtree
--no-post-probs mydata.ss noncons.mod

4. This time estimate all free parameters of the tree models.

phastCons --target-coverage 0.25 --expected-length 12 --estimate-trees newtree
--no-post-probs mydata.ss noncons.mod

5. Estimate the state-transition parameters but not the tree models. Output the
conservation scores but not the conserved elements.

phastCons mydata.ss cons.mod,noncons.mod > scores.wig

6. Estimate just the expected-length parameter and also estimate rho.

phastCons --target-coverage 0.25 --estimate-rho newtree mydata.ss noncons.mod >
scores.wig

OPTIONS

   Tree models
       --rho, -R <rho>

              Set the *scale* (overall evolutionary rate) of the model for the conserved state to
              be <rho> times that of the model for the  non-conserved  state  (0  <  <rho>  <  1;
              default 0.3).  If used with --estimate-trees or --estimate-rho, the specified value
              will be used for initialization only (rho  will  be  estimated).   This  option  is
              ignored if two tree models are given.

       --estimate-trees,  -T  <fname_root>  Estimate free parameters of tree models and write new
              models to <fname_root>.cons.mod and <fname_root>.noncons.mod.

       --estimate-rho, -O <fname_root>

              Like --estimate-trees, but estimate only the parameter rho.

       --gc,  -G  <val>  (Optionally  use  with  --estimate-trees  or  --estimate-rho)  Assume  a
              background nucleotide distribution consistent with the given average G+C content (0
              < <val> < 1) when estimating tree models.  (The frequencies of G and C will be  set
              to <val>/2 and the frequencies of A and T will be set to (1-<val>)/2.)  This option
              overrides the default behavior of estimating the background distribution  from  the
              data   (if   --estimate-trees)   or   obtaining  them  from  the  input  model  (if
              --estimate-rho).

       --nrates, -k <nrates> |  <nrates_conserved,nrates_nonconserved>  (Optionally  use  with  a
              discrete-gamma  model  and  --estimate-trees)  Assume  the specified number of rate
              categories, instead of the number given in the *.mod  file.   The  shape  parameter
              'alpha'  will  be as given in the *.mod file.  In the case of the default two-state
              HMM, two values can be specified, for the numbers of rates for  the  conserved  and
              the nonconserved states, resp.

   State-transition parameters
       --transitions, -t [~]<mu>,<nu>

              Fix  the  transition  probabilities  of the two-state HMM as specified, rather than
              estimating them by  maximum  likelihood.   Alternatively,  if  first  character  of
              argument  is  '~',  estimate  parameters,  but initialize to specified values.  The
              argument <mu> is the  probability  of  transitioning  from  the  conserved  to  the
              non-conserved  state,  and  <nu> is the probability of the reverse transition.  The
              probabilities of self transitions are thus  1-<mu>  and  1-<nu>  and  the  expected
              lengths of conserved and nonconserved elements are 1/<mu> and 1/<nu>, respectively.

       --target-coverage, -C <gamma>

              (Alternative  to  --transitions)  Constrain  transition  parameters  such  that the
              expected fraction of sites in conserved elements is <gamma>  (0  <  <gamma>  <  1).
              This  is  a *prior* rather than *posterior* expectation and assumes stationarity of
              the state-transition process.  Adding this constraint causes the ratio mu/nu to  be
              fixed  at  (1-<gamma>)/<gamma>.   If  used  with  --expected-length, the transition
              probabilities will be completely fixed;  otherwise  the  expected-length  parameter
              <omega>  will be estimated by maximum likelihood.  --expected-length, -E [~]<omega>
              {--expected-lengths also allowed, for backward compatibility}

              (For use with  --target-coverage,  alternative  to  --transitions)  Set  transition
              probabilities  such  that  the  expected  length of a conserved element is <omega>.
              Specifically, the parameter mu is set to 1/<omega>.  If preceded  by  '~',  <omega>
              will be estimated, but will be initialized to the specified value.

   Input/output
       --msa-format, -i PHYLIP|FASTA|MPM|SS|MAF

       Alignment file format.
              Default is to guess format based on

       file contents.
              Note that the msa_view program can be used to

              convert between formats.

       --viterbi [alternatively --most-conserved], -V <fname> Predict discrete elements using the
              Viterbi algorithm and write to specified file.  Output is  in  BED  format,  unless
              <fname> has suffix ".gff", in which case output is in GFF.

       --score, -s (Optionally use with --viterbi) Assign a log-odds score to each prediction.

       --lnl, -L <fname>

              Compute  total  log  likelihood  using the forward algorithm and write to specified
              file.

       --no-post-probs, -n Suppress output of posterior probabilities.  Useful if  only  discrete
              elements or likelihood is of interest.

       --log, -g <log_fname>

              (Optionally  use  when  estimating  free  parameters)  Write  log  of  optimization
              procedure to specified file.

       --refidx, -r <refseq_idx> Use coordinate frame of specified sequence in output.  Default

              value is 1, first sequence in alignment; 0 indicates  coordinate  frame  of  entire
              multiple alignment.

       --seqname,  -N  <name>  (Optionally use with --viterbi) Use specified string for 'seqname'
              (GFF) or 'chrom' field in output file.  Default is obtained from  input  file  name
              (double filename root, e.g., "chr22" if input file is "chr22.35.ss").

       --idpref, -P <name>

              (Optionally  use with --viterbi) Use specified string as prefix of generated ids in
              output file.  Can be used to ensure ids are unique.  Default is obtained from input
              file name (single filename root, e.g., "chr22.35" if input file is "chr22.35.ss").

       --quiet, -q Proceed quietly (without updates to stderr).

       --help, -h

              Print this help message.  (Indels) [experimental]

       --indels, -I

              Expand HMM state space to model indels as described in Siepel & Haussler (2004).

       --max-micro-indel,  -Y  <length>  (Optionally  use  with  --indels)  Maximum  length of an
              alignment gap to be considered a "micro-indel" and therefore addressed by the indel
              model.   Gaps  longer than this threshold will be treated as missing data.  Default
              value is 20.

       --indel-params, -D [~]<alpha_0,beta_0,tau_0,alpha_1,beta_1,tau_1>

              (Optionally use with --indels and default two-state HMM) Fix the  indel  parameters
              at (alpha_0, beta_0, tau_0) for the conserved state and at (alpha_1, beta_1, tau_1)
              for the non-conserved state, rather than estimating  them  by  maximum  likelihood.
              Alternatively,  if  first  character  of  argument is '~', estimate parameters, but
              initialize with specified values.  Alpha_j is the  rate  of  insertion  events  per
              substitution  per site in state j (typically ~0.05), beta_j is the rate of deletion
              events per substitution per site  in  state  j  (typically  ~0.05),  and  tau_j  is
              approximately  the  inverse  of  the  expected  indel  length in state j (typically
              0.2-0.5).

       --indels-only, -J Like --indels but force the use of  a  single-state  HMM.   This  option
              allows  the  effect  of  the  indel  model  in  isolation  to be observed.  Implies
              --no-post-probs.  Use with --lnl.  (Felsenstein/Churchill model) [rarely used]

       --FC, -X

              (Alternative to --hmm; specify only one *.mod file with this  option)  Use  an  HMM
              with  a  state  for  every  rate  category  in  the  given  phylogenetic model, and
              transition  probabilities  defined  by  an  autocorrelation  parameter  lambda  (as
              described  by  Felsenstein  and  Churchill,  1996).  A rate constant for each state
              (rate category) will be multiplied by the branch lengths of the phylogenetic model,
              to  create  a  "scaled"  version  of the model for that state.  If the phylogenetic
              model was estimated using Yang's discrete gamma method  (-k  option  to  phyloFit),
              then  the rate constants will be defined according to the estimated shape parameter
              'alpha', as described by Yang (1994).  Otherwise, a nonparameteric  model  of  rate
              variation  must have been used (-K option to phyloFit), and the rate constants will
              be as defined (explicitly) in the *.mod file.  By  default,  the  parameter  lambda
              will be estimated by maximum likelihood (see --lambda).

       --lambda, -l [~]<lambda>

              (Optionally use with --FC) Fix lambda at the specified value rather than estimating
              it by maximum likelihood.  Alternatively, if first character is '~',  estimate  but
              initialize  at  specified  value.  Allowable range is 0-1.  With k rate categories,
              the transition probability between state i and state j will be lambda * I(i == j) +
              (1-lambda)/k,  where  I  is  the  indicator  function.  Thus, lambda = 0 implies no
              autocorrelation and lambda = 1 implies perfect autocorrelation.  (Coding potential)
              [experimental]

       --coding-potential, -p

              Use  parameter settings that cause output to be interpretable as a coding potential
              score.  By default, a simplified version of exoniphy's phylo-HMM is  used,  with  a
              noncoding  (background)  state,  a conserved non-coding (CNS) state, and states for
              the three codon positions.  This option implies --catmap "NCATS=4; CNS 1; CDS  2-4"
              --hmm  <default-HMM-file> --states CDS --reflect-strand background,CNS and a set of
              default *.mod files (all of which can be overridden).  This option can be used with
              or without --indels.

       --extrapolate, -e <phylog.nh> | default

              Extrapolate   to   a   larger   set   of  species  based  on  the  given  phylogeny
              (Newick-format).  The trees in the given tree models (*.mod files) must be subtrees
              of  the  larger  phylogeny.   For  each tree model M, a copy will be created of the
              larger phylogeny, then scaled such that the total  branch  length  of  the  subtree
              corresponding  to  M's  tree  equals  the total branch length of M's tree; this new
              version will then be used in place of M's tree.  (Any species name present in  this
              tree  but  not  in  the  data  will  be ignored.)  If the string "default" is given
              instead of a filename, then a phylogeny for 25 vertebrate species,  estimated  from
              sequence  data  for  Target  1  (CFTR)  of  the NISC Comparative Sequencing Program
              (Thomas et al., 2003), will be assumed.

       --alias, -A <alias_def>

              Alias names in input alignment according to given  definition,  e.g.,  "hg17=human;
              mm5=mouse;    rn3=rat".     Useful   with   default   *.mod   files,   e.g.,   with
              --coding-potential.  (Default models use generic  common  names  such  as  "human",
              "mouse",  and  "rat".   This  option allows a mapping to be established between the
              leaves of trees in these files and the sequences  of  an  alignment  that  uses  an
              alternative naming convention.)

   Custom HMMs [rarely used]
       --hmm, -H <hmm_fname>

              Name  of  HMM  file explicitly defining the probabilities of all state transitions.
              States in the file must correspond in number and order to  phylogenetic  models  in
              <mod_fname_list>.  Expected file format is as produced by 'hmm_train.'

       --catmap,  -c  <fname>|<string>  (Optionally  use  with --hmm) Mapping of feature types to
              category numbers.  Can give either a filename  or  an  "inline"  description  of  a
              simple category map, e.g., --catmap "NCATS = 3 ; CDS 1-3".

       --states, -S <state_list>

              States  of interest in the phylo-HMM, specified by number (indexing starts with 0),
              or if --catmap, by category name.  Default value is 1.  Choosing  --states  "0,1,2"
              will cause output of the sum of the posterior probabilities for states 0, 1, and 2,
              and/or of regions in which the Viterbi path coincides with (any of) states 0, 1, or
              2 (see --viterbi).

       --reflect-strand, -U <pivot_states>

              (Optionally  use  with  --hmm) Given an HMM describing the forward strand, create a
              larger HMM that allows for features on both strands by  "reflecting"  the  original
              HMM about the specified "pivot" states.  The new HMM will be used for prediction on
              both strands.  States can be specified by number (indexing starts with  0),  or  if
              --catmap, by category name.

   Missing data [rarely used]
       --require-informative,  -M <states> Require "informative" columns (i.e., columns with more
              than  two   non-missing-data   characters,   excluding   sequences   specified   by
              --not-informative)  in  specified  HMM  states,  to  help  eliminate false positive
              predictions.  States can be specified by number (indexing starts  with  0)  or,  if
              --catmap is used, by category name.  Non-informative columns will be given emission
              probabilities of zero.  By default, this option is active, with <states>  equal  to
              the  set  of  states  of  interest  for prediction (as specified by --states).  Use
              "none" to disable completely.

       --not-informative, -F <list>

              Do not consider the specified sequences (listed by name) when  deciding  whether  a
              column  is  informative.  This option may be useful when sequences are present that
              are very close to the reference sequence and thus do not contribute much in the way
              of  phylogenetic information.  E.g., one might use "--not-informative chimp" with a
              human-referenced  multiple   alignment   including   chimp   sequence,   to   avoid
              false-positive  predictions based only on human/chimp alignments (can be a problem,
              e.g., with --coding-potential).

       --ignore-missing, -z

              (For use when estimating transition probabilities) Ignore regions of  missing  data
              in  all  sequences  but  the  reference  sequence (excluding sequences specified by
              --not-informative)  when  estimating  transition  probabilities.   Can  help  avoid
              too-low  estimates  of  <mu>  and <nu> or too-high estimates of <lambda>.  Warning:
              this option should not be used with --viterbi because coordinates in output will be
              unrecognizable.

NAME

SYNOPSIS

DESCRIPTION

EXAMPLE

OPTIONS

SEE ALSO