Provided by: phast_1.6+dfsg-3_amd64 bug

NAME

       clean_genes - Given a GFF describing a set of genes and a corresponding

DESCRIPTION

       Given a GFF describing a set of genes and a corresponding multiple alignment, output a new
       GFF with only those genes that meet certain "cleanliness" criteria. The coordinates in the
       GFF are assumed to correspond to the reference sequence in the alignment, which is assumed
       to be the first one listed.  Default behavior is simply  to  require  that  all  annotated
       start/stop  codons and splice sites are valid in the reference sequence (GT-AG, GC-AG, and
       AT-AC splice sites are allowed).  This can be used with an  "alignment"  consisting  of  a
       single  sequence  to  filter  out  incorrect annotations.  Options are available to impose
       additional criteria as well, mostly having to do with conservation across species (see the
       '--conserved' option in particular).

SYNOPSIS

       clean_genes [options] <gff_fname> <msa_fname>

OPTIONS

       --start, -s

              Require conserved start codons (all species)

       --stop, -t

              Require conserved stop codons (all species)

       --splice, -l

       Require conserved splice sites (all species).
              By default,

              only GT-AG, GC-AG, and AT-AC splice sites are allowed (see also --splice-strict)

       --fshift, -f

       Require that no frame-shift gap is present in any species.
              Frame

       shifts are evaluated with respect to the reference sequence.
              Gaps

              that  have  non-multiple-of-three  lengths  are  allowed if compensatory gaps occur
              nearby (see source code for details).

       --nonsense, -n

              Require that no premature stop codon is present in any species.

       --conserved, -c

              Implies --start, --stop, --splice, --fshift, and  --nonsense.   Recommended  option
              for cross-species analysis.

       --N-limit, -N <f>

              Maximum  fraction  of bases aligned to CDSs that are Ns in any species (<f> must be
              between 0 and 1).  Default is 0.05.  Set to 1 to allow any number of Ns.

       --clean-gaps, -e

       Require all cds gaps to be multiples of three in length.
              Can be

              used with --conserved.

       --indel-strict, -I

       Implies --clean_gaps, usually used with --conserved.
              Prohibits

              overlapping cds gaps in different sequences, gaps near cds boundaries, and gaps  in
              the  reference  sequence  within and between flanking features (splice sites, etc.;
              see code for details).  Designed for use in training  a  phylo-HMM  with  an  indel
              model.

       --splice-strict, -C

       Implies --splice.
              Allow only GT-AG canonical splice sites.  Useful

              when training a gene finder with a simple model for splice sites.

       --groupby, -g <tag>

              Group  features  according  to  specified  tag  (default  "transcript_id").  If any
              feature within a group fails, the entire group will be discarded.  By  choosing  to
              group  features  according  to different criteria, you can make the program "clean"
              the data set at different levels.  For example, to clean at the level of individual
              exons,  add  a  tag like "exon_id" to indicate exons (see the program "refeature"),
              and then invoke clean_genes with "--groupby exon_id".

       --msa-format, -i FASTA|PHYLIP|MPM|MAF|SS

       Alignment file format.
              Default is to guess format from file

              contents.

       --refseq, -r <seqfile.fa>

       (Required with --msa-format MAF)
              Complete reference

              sequence for alignment (FASTA format).

       --offset5, -o <n>

       (Default 0)
              Offset of canonical "GT" with respect to boundary

       on *intron side* of annotated 5' splice sites.
              Useful with

              annotations that describe a window around the canonical splice site.

       --offset3, -p <n>

       (Default 0)
              Offset of canonical "AG" with respect to boundary

              on intron side of annotated 3' splice sites.

       --log, -L <fname>

              Write human-readable log to specified file.

       --machine-log, -M <fname>

              Like --log, but produces more concise, machine-readable log.

       --stats, -S <fname>

              Write statistics on retained and discarded features to specified file.

       --discards, -d <fname>

              Write discarded features to specified file.

       --no-output, -x

       Suppress output of "cleaned" features to stdout.
              Useful if only

              log file and/or stats are of interest.

       --help, -h

              Print this help message.

       NOTES:  Feature types are defined as follows.

       coding exon
              <-> "CDS"

       start codon
              <-> "start_codon"

       stop codon
              <-> "stop_codon"

              5' splice site <-> "5'splice" 3' splice site <-> "3'splice"

              In addition, splice sites in UTR can be separately designated as "5'splice_utr" and
              "3'splice_utr".   Errors  in  these sites will be given a different code in the log
              files, which can be useful for tracking purposes.

              If evaluation is done at the level of individual exons (see --groupby), then splice
              sites  are  considered  independently  rather than in the context of introns.  As a
              result, the exons  flanking  a  GT-AC  or  AT-AG  intron  might  (misleadingly)  be
              considered "clean".

              With  --fshift and --nonsense, it is possible for entries to pass through that have
              stop codons in the frame of the *reference* sequence, although they do not have any
              in  their  own  frame.   Use  --clean-gaps instead to guarantee that no stop codons
              occur in any sequence in the frame of the reference sequence.