lunar (1) transterm.1.gz

Provided by: transtermhp_2.09-5_amd64 bug

NAME

        transterm  - Finds rho-independent transcription terminators in bacterial genomes.

SYNOPSIS

       transterm -p expterm.dat seq.fasta annotation.ptt > output.tt

DESCRIPTION

       Any number of fasta and annotation files can be listed but fasta files should come before
       annotation files. The type of the file is determined by the extension:

           .ptt               a GenBank ptt annotation file
           .coords or .crd    a simple annotation file

       Each line of a .coords or .crd file has the format:

           gene_name  start  end  chrom_id

       The chrom_id specifies which sequence the annotation should apply to. For a .ptt file, the
       chrom_id is taken to be the filename with the path and extension removed. A filename with
       any other extension is assumed to be a fasta file.

       When processing an annotation for a chromosom with id = ID, the first word of the '>'
       lines of the input sequences are searched for ID.  Because there is no good standard for
       how the '>' line is formatted, several heuristics are tried to find ID in the '>' line. In
       the order tried, they are:

           >ID
           >junk|cmr:ID|junk or junk|ID|junk
           >junk|gi|ID|junk or >junk|gi|ID.junk|junk
           >junk:ID

       The option '-p expterm.dat' uses the newest confidence scheme, where expterm.dat is the
       path to the file of that name supplied with TransTermHP. If '-p expterm.dat' is omited,
       the version 1.0 confidence scheme is used. See section 'COMMAND LINE OPTIONS' for more
       detail.

   FORMAT OF THE TRANSTERM OUTPUT
       The organism's genes are listed sorted by their end coordinate and terminators are output
       between them. A terminator entry looks like this:

           TERM 19  15310 - 15327  -      F     99      -12.7 -4.0 |bidir
           (name)   (start - end)  (sense)(loc) (conf) (hp) (tail) (notes)

       where 'conf' is the overall confidence score, 'hp' is the hairpin score, and 'tail' is the
       tail score. 'Conf' (which ranges from 0 to 100) is what you probably want to use to assess
       the quality of a terminator. Higher is better.  The confidence, hp score, and tail scores
       are described in the paper cited above.  'Loc' gives type of region the terminator is in:

           'G' = in the interior of a gene (at least 50bp from an end),
           'F' = between two +strand genes,
           'R' = between two -strand genes,
           'T' = between the ends of a +strand gene and a -strand gene,
           'H' = between the starts of a +strand gene and a -strand gene,
           'N' = none of the above (for the start and end of the DNA)

       Because of how overlapping genes are handled, these designations are not exclusive. 'G',
       'F', or 'R' can also be given in lowercase, indicating that the terminator is on the
       opposite strand as the region.  Unless the --all-context option is given, only candidate
       terminators that appear to be in an appropriate genome context (e.g. T, F, R) are output.

       Following the TERM line is the sequence of the hairpin and the 5' and 3' tails, always
       written 5' to 3'.

   TRANSTERM COMMAND LINE OPTIONS
       You can also set how large a hairpin must be to be considered:

           --min-stem=n    Stem must be n nucleotides long
           --min-loop=n    Loop portion of the hairpin must be at least n long

       You can also set the maximum size of the hairpin that will be found:

           --max-len=n     Total extent of hairpin <= n NT long
           --max-loop=n    The loop portion can be no longer than n

       The maximum length is the total length for the hairpin portion (2 stems, 1 loop) and does
       not include the U-tail. It's measured in nuceotides in the input sequence, so because of
       gaps, the actual structure may be longer than max-len.  Max-len must be less than the
       compiled-in constant REALLY_MAX_UP (which by default is 1000). To increase the size of
       structures found recompile after increasing this constant.

       TransTermHP assigns a score to the hairpin and tail portions of potential terminators.
       Lower scores are considered better. Many of the constants used in scoring hairpins can be
       set from the command line:

           --gc=f       Score of a G-C pair
           --au=f       Score of an A-U pair
           --gu=f       Score of a G-U pair
           --mm=f       Score of any other pair
           --gap=f      Score of a gap in the hairpin

       The cost of loops of various lengths can be set using:

           --loop-penalty=f1,f2,f3,f4,f5,...fn

       where f1 is the cost of a loop of length --min-loop, f2 is the cost of a loop of length
       --min-loop+1, as so on. If there are too few terms to cover up to max-loop, the last term
       is repeated. Thus --loop-penalty=0,2 would assign cost 0 to any loop of length min-loop,
       and 2 to any longer loop (up to max-loop, after which longer loops are given infinite
       scores). Extra terms are ignored.

       Note that if you are using the --pval-conf confidence scheme (see below), you must
       regenerate the expterm.dat file if you change any of the above constants.

       To weed out any potential terminator with tail or hairpin scores that are too large, you
       can use the following options:

           --max-hp-score=f    Maximum allowable hairpin score
           --max-tail-score=f  Maximum allowable tail score

       Terminator hairpins must be adjacent to a "U-rich" region. You can adjust the constants
       the define what constitutes a U-rich region. Using the options:

           --uwin-size=s
           --uwin-require=r

       requires that there are at least r 'U' nucleotides in the s-nucleotide-long window
       adjacent to the hairpin. Again, if you change these constants, you should regenerate
       expterms.dat.

       Before the main output, TransTermHP will output the values of the above options in a
       format suitable to be used on the command line.

       In addition to the tail and hairpin scores, each possible terminator is assigned a
       confidence --- a value between 0 and 100 that indicates how likely it is that the sequence
       is a terminator. The scoring scheme needs a background file (supplied with TransTermHP)
       that is specified using:

           --pval-conf expterms.dat

       This will use the distribution in the file expterms.dat as the background. (You can
       abreivate this as "-p expterms.dat".) Though the supplied expterms.dat file is derived
       from random sequences, any background distribution can be used by supplying your own
       expterms.dat file.  See below for the format of expterms.dat.  The values in expterms.dat
       depend on the scoring constants, definition of u-rich regions, and the maximum allowed
       tail and hp scores.  Thus, if you change any of these constants using the options above,
       you should regenerate expterms.dat.

       The main output of TransTermHP is a list of terminators interleaved between a listing of
       the gene annotations that were provided as input. This output can be customized in a few
       ways:

           -S              Don't output the terminator sequences
           --min-conf=n    Only output terminators with confidence >= n (can
                           abbreviate this as -c n; default is 76.)

       Additional analysis output can be obtained with the following options:

           --bag-output file.bag  Output the Best terminator After Gene
           --t2t-perf file.t2t    Output a summary of which tail-to-tail regions
                                  have good terminators

   RECALIBRATING USING DIFFERENT PARAMETERS
       As mentioned above, if you change any of the basic scoring function and search parameters
       and are using the version 2.0 confidence scheme (recommended) then you have to recompute
       the values in the expterm.dat file. If you have python installed this is easy (though
       perhaps time consuming). You can issue the command:

           % calibrate.sh newexpterms.dat [OPTIONS TO TRANSTERM]

       where "[OPTIONS TO TRANSTERM]" are TransTermHP options (discussed above) that set the
       parameters to what you want them to be. After calibrate.sh finishes, newexpterms.dat will
       be in the current directory and can serve as an argument to -p when using the same
       parameters you passed to calibrate.sh.

       Note that for the newexpterms.dat to be valid, you must supply the same basic parameters
       to TransTermHP on subsequent runs. TransTerm (or newexpterms.dat) will not remember these
       parameters for you. The best way to handle this is to make a shell script wrapper around
       transterm that always passes in your new parameters.

       Output formatting parameters do not require regeneration of expterms.dat --- see
       discussion above for which parameters expterm.dat depends on.

       calibrate.sh can be found in /usr/share/doc/transtermhp/examples directory.

   FORMAT OF THE EXPTERMS.DAT FILE
       The 'pval-conf' confidence scheme, selected with the option "--pval-conf expterms.dat" (or
       '-p expterms.dat') computes the confidence of a terminator with HP energy E and tail
       energy T as follows.  First, the ranges of HP energies and tail energies are evenly
       divided into bins, and the appropriate bins e and t are found for E and T. Then the
       confidence is computed as described in [2].

       The first line of expterms.dat contains 6 numbers:

          seqlen  num_bins

       The (low_hp, high_hp) and (low_tail, high_tail) ranges give the bounds on the hairpin and
       tail scores. The integer num_bins gives the number of equally-sized bins into which those
       ranges are divided. Seqlen gives the length of the random sequence that was used to
       generate the data in the rest of the file.

       Following this line are any number of (at, R, M) triples, where 'at' is the AT content, R
       is a 4-tuple (low_hp, high_hp, low_tail, high_tail) giving the range of the HP and tail
       scores observed in random sequences of this AT content, and M is the distribution matrix.
       These (at, R, M) triples are formatted as follows:

          at  low_hp  high_hp  low_tail  high_tail
          n11 n12 n13 n14 ... n1,num_bins
          n21  ...
          ...
          n_num_bins,1 ...

       The mu_r(e,t) term is computed by selecting the matrix with the at value closest to the
       computed %AT of the region r. If the total length of region r sequence is L_r, then

         mu_r(e,t) = n_t_e * L_r/seqlen

       where n_t_e is the entry in the t-th row and e-th column of the selected matrix, and
       seqlen is the first number in the first line of the file.

SEE ALSO

       2ndscore(1)

                                            2011-02-19                               TRANSTERM(1)