Provided by: tigr-glimmer_3.02b-2build1_amd64 bug

NAME

       long-orfs — Find/Score potential genes in genome-file using the probability model in icm-file

SYNOPSIS

       tigr-long-orgs [genome-file options]

DESCRIPTION

       Program  long-orfs  takes  a  sequence  file  (in FASTA format) and outputs a list of all long "potential
       genes" in it that do not overlap by too much.  By "potential gene" I mean the portion of an orf from  the
       first start codon to the stop codon at the end.

       The first few lines of output specify the settings of various parameters in the program:

       Minimum  gene  length  is  the  length  of  the smallest fragment considered to be a gene.  The length is
       measured from the first base of the start codon to the last base *before* the stop codon.  This value can
       be  specified  when  running  the program with the  -g  option.  By default, the program now (April 2003)
       will compute an optimal length for this parameter,  where  "optimal"  is  the  value  that  produces  the
       greatest number of long ORFs, thereby increasing the amount of data used for training.

       Minimum overlap length is a lower bound on the number of bases overlap between 2 genes that is considered
       a problem.  Overlaps shorter than this are ignored.

       Minimum overlap percent is another lower bound on the number  of  bases  overlap  that  is  considered  a
       problem.  Overlaps shorter than this percentage of *both* genes are ignored.

       The next portion of the output is a list of potential genes:

       Column  1  is  an ID number for reference purposes.  It is assigned sequentially starting with  1  to all
       long potential genes.  If overlapping genes are eliminated, gaps in  the  numbers  will  occur.   The  ID
       prefix is specified in the constant  ID_PREFIX .

       Column 2 is the position of the first base of the first start codon in the orf.  Currently I use atg, and
       gtg as start codons.  This is easily changed in the function  Is_Start () .

       Column 3 is the position of the last base *before* the stop codon.  Stop codons are taa,  tag,  and  tga.
       Note  that for orfs in the reverse reading frames have their start position higher than the end position.
       The order in which orfs are listed is in increasing order by  Max  {OrfStart,  End},  i.e.,  the  highest
       numbered position in the orf, except for orfs that "wrap around" the end of the sequence.

       When  two  genes  with  ID numbers overlap by at least a sufficient amount (as determined by Min_Olap and
       Min_Olap_Percent ), they are eliminated and do not appear in the output.

       The final output of the program (sent to the standard error file so it does not show up  when  output  is
       redirected to a file) is the length of the longest orf found.

       Specifying Different Start and Stop Codons:

       To  specify  different  sets  of  start  and  stop  codons,  modify  the file gene.h .  Specifically, the
       functions:

       Is_Forward_Start     Is_Reverse_Start     Is_Start Is_Forward_Stop      Is_Reverse_Stop      Is_Stop

       are used to determine what is used for start and stop codons.

       Is_Start  and  Is_Stop  do simple string comparisons to specify which patterns are used.  To  add  a  new
       pattern,  just  add the comparison for it.  To remove a pattern, comment out or delete the comparison for
       it.

       The other four functions use a bit comparison to determine start and stop  patterns.   They  represent  a
       codon as a 12-bit pattern, with 4 bits for each base, one bit for each possible value of the bases, T, G,
       C or A.  Thus the bit pattern  0010 0101 1100  represents the base pattern  [C] [A or G] [G  or  T].   By
       doing  bit operations (& | ~) and comparisons, more complicated patterns involving ambiguous reads can be
       tested efficiently.  Simple patterns can be tested as in the current code.

       For example, to insert an additional start codon of CAT requires 3 changes:  1.  The  line  ||  (Codon  &
       0x218)  == Codon should be inserted into  Is_Forward_Start , since 0x218 = 0010 0001 1000 represents CAT.
       2. The line || (Codon & 0x184) == Codon should be inserted into  Is_Reverse_Start , since  0x184  =  0001
       1000  0100  represents  ATG,  which  is the reverse-complement of CAT.  Alternately, the #define constant
       ATG_MASK  could be used.  3. The line || strncmp (S, "cat", 3) == 0 should be inserted into  Is_Start .

OPTIONS

       -g n      Set minimum gene length to n.  Default is to compute an  optimal  value  automatically.   Don't
                 change this unless you know what you're doing.

       -l        Regard  the  genome as linear (not circular), i.e., do not allow genes to "wrap around" the end
                 of the genome.  This option works on both  glimmer and long-orfs .  The default behavior is  to
                 regard the genome as circular.

       -o n      Set maximum overlap length to n.  Overlaps shorter than this are permitted.  (Default is 0 bp.)

       -p n      Set  maximum overlap percentage to n%.  Overlaps shorter than this percentage of *both* strings
                 are ignored.  (Default is 10%.)

SEE ALSO

       tigr-glimmer3 (1), tigr-adjust (1), tigr-anomaly   (1), tigr-build-icm (1), tigr-check  (1),  tigr-codon-
       usage  (1),  tigr-compare-lists  (1),  tigr-extract  (1),  tigr-generate (1), tigr-get-len (1), tigr-get-
       putative (1),

       http://www.tigr.org/software/glimmer/

       Please see the readme in /usr/share/doc/tigr-glimmer for a description on how to use Glimmer3.

AUTHOR

       This manual page was quickly copied from the glimmer web site by Steffen Moeller  moeller@debian.org  for
       the Debian system.

                                                                                                    LONG-ORFS(1)