Provided by: tigr-glimmer_3.02-3_amd64 bug

NAME

       long-orfs  — Find/Score potential genes in genome-file using the probability model in icm-
       file

SYNOPSIS

       tigr-long-orgs [genome-file options]

DESCRIPTION

       Program long-orfs takes a sequence file (in FASTA format) and outputs a list of  all  long
       "potential  genes"  in it that do not overlap by too much.  By "potential gene" I mean the
       portion of an orf from the first start codon to the stop codon at the end.

       The first few lines of output specify the settings of various parameters in the program:

       Minimum gene length is the length of the smallest fragment considered to be a  gene.   The
       length  is  measured  from the first base of the start codon to the last base *before* the
       stop codon.  This value can be specified when running the program with  the   -g   option.
       By  default,  the  program  now  (April  2003)  will  compute  an  optimal length for this
       parameter, where "optimal" is the value that produces the greatest number  of  long  ORFs,
       thereby increasing the amount of data used for training.

       Minimum  overlap  length  is  a lower bound on the number of bases overlap between 2 genes
       that is considered a problem.  Overlaps shorter than this are ignored.

       Minimum overlap percent is another lower bound on the number  of  bases  overlap  that  is
       considered a problem.  Overlaps shorter than this percentage of *both* genes are ignored.

       The next portion of the output is a list of potential genes:

       Column  1  is  an  ID number for reference purposes.  It is assigned sequentially starting
       with  1  to all long potential genes.  If overlapping genes are eliminated,  gaps  in  the
       numbers will occur.  The ID prefix is specified in the constant  ID_PREFIX .

       Column 2 is the position of the first base of the first start codon in the orf.  Currently
       I use atg, and gtg as start codons.  This is easily changed in the function  Is_Start () .

       Column 3 is the position of the last base *before* the stop codon.  Stop codons  are  taa,
       tag,  and tga.  Note that for orfs in the reverse reading frames have their start position
       higher than the end position.  The order in which orfs are listed is in  increasing  order
       by  Max  {OrfStart,  End}, i.e., the highest numbered position in the orf, except for orfs
       that "wrap around" the end of the sequence.

       When two genes with ID numbers overlap by at least a sufficient amount (as  determined  by
       Min_Olap and Min_Olap_Percent ), they are eliminated and do not appear in the output.

       The  final  output  of the program (sent to the standard error file so it does not show up
       when output is redirected to a file) is the length of the longest orf found.

       Specifying Different Start and Stop Codons:

       To  specify  different  sets  of  start  and  stop  codons,  modify  the  file  gene.h   .
       Specifically, the functions:

       Is_Forward_Start      Is_Reverse_Start      Is_Start  Is_Forward_Stop      Is_Reverse_Stop
       Is_Stop

       are used to determine what is used for start and stop codons.

       Is_Start  and  Is_Stop  do simple string comparisons to specify which patterns  are  used.
       To add a new pattern, just add the comparison for it.  To remove a pattern, comment out or
       delete the comparison for it.

       The other four functions use a bit comparison to determine start and stop patterns.   They
       represent  a  codon  as  a  12-bit  pattern,  with  4 bits for each base, one bit for each
       possible value of the bases, T, G,  C  or  A.   Thus  the  bit  pattern   0010  0101  1100
       represents  the  base pattern  [C] [A or G] [G or T].  By doing bit operations (& | ~) and
       comparisons,  more  complicated  patterns  involving  ambiguous  reads   can   be   tested
       efficiently.  Simple patterns can be tested as in the current code.

       For example, to insert an additional start codon of CAT requires 3 changes: 1. The line ||
       (Codon & 0x218) == Codon should be inserted into  Is_Forward_Start , since  0x218  =  0010
       0001 1000 represents CAT.  2. The line || (Codon & 0x184) == Codon should be inserted into
       Is_Reverse_Start , since 0x184 = 0001 1000 0100 represents  ATG,  which  is  the  reverse-
       complement  of  CAT.   Alternately, the #define constant  ATG_MASK  could be used.  3. The
       line || strncmp (S, "cat", 3) == 0 should be inserted into  Is_Start .

OPTIONS

       -g n      Set minimum  gene  length  to  n.   Default  is  to  compute  an  optimal  value
                 automatically.  Don't change this unless you know what you're doing.

       -l        Regard  the  genome  as linear (not circular), i.e., do not allow genes to "wrap
                 around" the end of the genome.  This option works on both  glimmer and long-orfs
                 .  The default behavior is to regard the genome as circular.

       -o n      Set  maximum  overlap  length  to  n.  Overlaps shorter than this are permitted.
                 (Default is 0 bp.)

       -p n      Set maximum overlap percentage to n%.  Overlaps shorter than this percentage  of
                 *both* strings are ignored.  (Default is 10%.)

SEE ALSO

       tigr-glimmer3  (1),  tigr-adjust  (1),  tigr-anomaly   (1), tigr-build-icm (1), tigr-check
       (1), tigr-codon-usage (1), tigr-compare-lists (1), tigr-extract  (1),  tigr-generate  (1),
       tigr-get-len (1), tigr-get-putative (1),

       http://www.tigr.org/software/glimmer/

       Please  see  the  readme  in  /usr/share/doc/tigr-glimmer  for a description on how to use
       Glimmer3.

AUTHOR

       This manual page was  quickly  copied  from  the  glimmer  web  site  by  Steffen  Moeller
       moeller@debian.org for the Debian system.

                                                                                     LONG-ORFS(1)