Ubuntu Manpage: exoniphy - Prediction of evolutionarily conserved protein-coding exons using Required

NAME

       exoniphy  -  Prediction  of  evolutionarily  conserved protein-coding exons using Required
       argument <msa_fname> must be a multiple alignment file, in one of several possible formats
       (see --msa-format).

DESCRIPTION

       Prediction  of  evolutionarily  conserved protein-coding exons using a phylogenetic hidden
       Markov model (phylo-HMM).  By default, a model definition and model  parameters  are  used
       that are appropriate for exon prediction in human DNA, based on human/mouse/rat alignments
       and a 60-state HMM.  Using the --hmm, --tree-models, and --catmap options, however, it  is
       possible  to  define  alternative  phylo-HMMs,  e.g.,  for  different  sets of species and
       different phylogenies, or for prediction of exon pairs or complete gene structures.

OPTIONS

              (Model definition and model parameters)

       --hmm, -H <fname>

              Name of HMM file defining states and transition  probabilities.   By  default,  the
              60-state  HMM  described  in  Siepel  &  Haussler  (2004)  is used, with transition
              probabilities appropriate for mammalian genomes (estimated  as  described  in  that
              paper).

       --tree-models, -m <fname_list> List of tree model (*.mod) files, one for each state in the
              HMM.  Order of models must correspond to order of states in HMM file.  By  default,
              a  set  of  models  appropriate  for  human,  mouse, and rat are used (estimated as
              described in Siepel & Haussler, 2004).

       --catmap, -c <fname>|<string>

       Mapping of feature types to category numbers.
              Can give either

              a filename or an "inline" description of a  simple  category  map,  e.g.,  --catmap
              "NCATS  = 3 ; CDS 1-3".  By default, a category map is used that is appropriate for
              the 60-state HMM mentioned above.

       --extrapolate, -e <phylog.nh> | default Extrapolate to a larger set of  species  based  on
              the  given  phylogeny  (Newick-format).   The trees in the given tree models (*.mod
              files) must be subtrees of the larger phylogeny.  For each tree  model  M,  a  copy
              will  be  created  of  the larger phylogeny, then scaled such that the total branch
              length of the subtree corresponding to M's tree equals the total branch  length  of
              M's  tree;  this  new version will then be used in place of M's tree.  (Any species
              name present in this tree but not in the data will  be  ignored.)   If  the  string
              "default"  is  given  instead  of  a  filename,  then a phylogeny for 25 vertebrate
              species, estimated from sequence data for Target 1 (CFTR) of the  NISC  Comparative
              Sequencing Program (Thomas et al., 2003), will be assumed.

       --data-path, -D <path>

              Path  to  the  directory  with  phast  data.  Exoniphy  default models should be in
              <path>/exoniphy/. Default is set at compile time.

              (Input and output)

       --msa-format, -i FASTA|PHYLIP|MPM|MAF|SS

       File format of input alignment.
              Default is to guess alignment

              format from file contents.

       --score, -S Report log-odds scores for predictions, equal to their log  total  probability
              under  an  exon  model  minus their log total probability under a background model.
              The exon model  can  be  altered  using  --cds-types  and  --signal-types  and  the
              background model can be altered using --backgd-types (see below).

       --seqname, -s <name>

              Use  specified  string  as "seqname" field in GFF output.  Default is obtained from
              input  file  name  (double  filename  root,  e.g.,  "chr22"  if   input   file   is
              "chr22.35.ss").

       --idpref, -p <name>

              Use  specified  string  as  prefix  of generated ids in GFF output.  Can be used to
              ensure ids are unique.  Default is obtained from input file name  (single  filename
              root, e.g., "chr22.35" if input file is "chr22.35.ss").

       --grouptag,  -g  <tag>  Use  specified  string  as  the  tag denoting groups in GFF output
              (default is "transcript_id").

       --alias, -A <alias_def>

              Alias names in input alignment according to given  definition,  e.g.,  "hg17=human;
              mm5=mouse;  rn3=rat".   Useful  with  default  tree  models and with --extrapolate.
              (Default models use generic common names such as "human", "mouse", and "rat".  This
              option  allows  a  mapping  to  be established between the leaves of trees in these
              files  and  the  sequences  of  an  alignment  that  uses  an  alternative   naming
              convention.)

              (Altering the states and transition probabilities of the HMM)

       --no-cns, -x

              Eliminate  the state/category for conserved noncoding sequence from the default HMM
              and category map.  Ignored if non-default HMM and category map are selected.

       --reflect-strand, -U

              Given an HMM describing the forward strand, create a larger  HMM  that  allows  for
              features  on  both strands by "reflecting" the HMM about all states associated with
              background  categories  (see  --backgd-cats).   The  new  HMM  will  be  used   for
              predictions  on both strands.  If the default HMM is used, then this option will be
              used automatically.

       --bias, -b <val>

              Set "coding bias" equal to the specified value (default -3.33  if  default  HMM  is
              used,  0  otherwise).   The  coding  bias  is  added  to  the  log probabilities of
              transitions from background states to non-background  states  (see  --backgd-cats),
              then  all  transition  probabilities  are  renormalized.   If  the  coding  bias is
              positive, then more predictions will tend to be made and sensitivity will  tend  to
              improve,  at  some  cost  to specificity; if it is negative, then fewer predictions
              will tend to be made, and specificity  will  tend  to  improve,  at  some  cost  to
              sensitivity.

       --sens-spec,  -Y <fname-root> Make predictions for a range of different coding biases (see
              --bias), and write results to files with given  filename  root.   This  allows  the
              sensitivity/specificity  tradeoff to be examined.  The range is fixed at -20 to 10,
              and 10 different sets of predictions are produced.  (Feature types)

       --backgd-types, -B <list>

              Feature types to be  considered  "background"  (default  value:  "background,CNS").
              Affects --reflect-strand, --score, and --bias.

       --cds-types, -C <list>

              (for use with --score) Feature types that represent protein-coding regions (default
              value: "CDS").

       --signal-types, -L <list> (for use with  --score)  Types  of  features  to  be  considered
              "signals"            during            scoring            (default           value:
              "start_codon,stop_codon,5'splice,3'splice,prestart,cds5'ss,cds3'ss").  One score is
              produced  for  a  CDS  feature  (as defined by --cds-types) and the adjacent signal
              features; the score is then assigned to the CDS feature.

              (Indels)

       --indels, -I

              Use the indel model described in Siepel & Haussler (2004).

       --no-gaps, -W <list> Prohibit gaps in sites of the specified categories  (gaps  result  in
              emission  probabilities  of  zero).   If  the  default  category  map  is used (see
              --catmap), then gaps are prohibited in start and stop codons and at  the  canonical
              GT  and  AG  positions  of  splice  sites (with or without --indels).  In all other
              cases, the default behavior is to treat gaps as missing data, or  to  address  them
              with the indel model (--indels).

       --require-informative, -N <list>

              Require  "informative"  columns  (i.e., columns with more than two non-missing-data
              characters, excluding  sequences  specified  by  --not-informative)  in  the  given
              categories  (list  by  name  or  number).   Non-informative  columns  will be given
              emission probabilities  of  zero.   If  the  default  category  map  is  used  (see
              --catmap),  then  this option applies automatically to CDSs, start and stop codons,
              and the canonical GT and AG positions of splice sites.  Note  that  alignment  gaps
              *are*  considered  informative; the way they are handled is defined by --indels and
              --no-gaps.

       --not-informative, -n <list>

              Do not consider the specified sequences (listed by name) when  deciding  whether  a
              column  is  informative.  This option can be useful when sequences are present that
              are very close to the reference sequence and thus do not contribute much in the way
              of  phylogenetic information.  E.g., one might use "--not-informative chimp" with a
              human-referenced multiple alignment including chimp sequence.

              (Other)

       --quiet, -q

              Proceed quietly (without messages to stderr).

       --help -h Print this help message.

       REFERENCES:  A.  Siepel  and  D.  Haussler.   2004.    Computational   identification   of
       evolutionarily conserved exons.  Proc. 8th Annual Int'l Conf.

              on  Research  in Computational Biology (RECOMB '04), pp. 177-186.  J. Thomas et al.
              2003.  Comparative  analyses  of  multi-species  sequences  from  targeted  genomic
              regions.  Nature 424:788-793.