Ubuntu Manpage: tRNAscan-SE - improved detection of transfer RNA genes in genomic sequence

name
synopsis
description
options
see also
bugs
notes
references

NAME

       tRNAscan-SE - improved detection of transfer RNA genes in genomic sequence

SYNOPSIS

       tRNAscan-SE [options] seqfile(s)

DESCRIPTION

tRNAscan-SE searches for transfer RNAs in genomic sequence seqfile(s) using three separate methods to
achieve a combination of speed, sensitivity, and selectivity not available with each program
individually.

tRNAscan-SE was written in the PERL (version 5.0) script language. Input consists of DNA or RNA
sequences in FASTA format. tRNA predictions are output in standard tabular, ACeDB-compatible, or an
extended format including tRNA secondary structure information. tRNAscan-SE does no tRNA detection
itself, but instead combines the strengths of three independent tRNA prediction programs by negotiating
the flow of information among them, performing a limited amount of post-processing, and outputting the
result.

tRNAscan-SE combines the specificity of the Cove probabilistic RNA prediction package (Eddy & Durbin,
1994) with the speed and sensitivity of tRNAscan 1.3 (Fichant & Burks, 1991) plus an implementation of an
algorithm described by Pavesi and colleagues (1994) which searches for eukaryotic pol III tRNA promoters
(our implementation referred to as EufindtRNA). tRNAscan and EufindtRNA are used as first-pass
prefilters to identify "candidate" tRNA regions of the sequence. These subsequences are then passed to
Cove for further analysis, and output if Cove confirms the initial tRNA prediction. In this way,
tRNAscan-SE attains the best of both worlds: (1) a false positive rate equally low to using Cove
analysis, (2) the combined sensitivities of tRNAscan and EufindtRNA (detection of 99% of true tRNAs), and
(3) search speed 1,000 to 3,000 times faster than Cove analysis and 30 to 90 times faster than the
original tRNAscan 1.3 (tRNAscan-SE uses both a code-optimized version of tRNAscan 1.3 which gives a
650-fold increase in speed, and a fast C implementation of the Pavesi et al. algorithm).

tRNAscan-SE was designed to make rapid, sensitive searches of genomic sequence feasible using the
selectivity of the Cove analysis package. Search sensitivity was optimized with eukaryote cytoplasmic &
eubacterial sequences, but it may be applied more broadly with a slight reduction in sensitivity.

In the default tabular output format, each new tRNA in a sequence is consecutively numbered in the 'tRNA
#' column. 'tRNA Bounds' specify the starting (5') and ending (3') nucleotide bounds for the tRNA.
tRNAs found on the reverse (lower) strand are indicated by having the Begin (5') bound greater than the
End (3') bound.

The 'tRNA Type' is the predicted amino acid charged to the tRNA molecule based on the predicted anticodon
(written 5'->3') displayed in the next column. tRNAs that fit criteria for potential pseudogenes (poor
primary or secondary structure), will be marked with "Pseudo" in the 'tRNA Type' column (pseudogene
checking is further discussed in the Methods section of the program manual). If there is a predicted
intron in the tRNA, the next two columns indicate the nucleotide bounds. If there is no predicted
intron, both of these columns contain zero.

The final column is the Cove score for the tRNA in bits of information. Specifically, it is a log-odds
score: the log of the ratio of the probability of the sequence given the tRNA covariance model used
(developed from hand-alignment of 1415 tRNAs), and the probability of the sequence given a simple random
sequence model. tRNAscan-SE counts any sequence that attains a score of 20.0 bits or larger as a tRNA
(based on empirical studies conducted by Eddy & Durbin in ref #2).

OPTIONS

-h Prints entire list of program options, each with a brief, one-line description.

-P This option selects the prokaryotic covariace model for tRNA analysis, and loosens the search
parameters for EufindtRNA to improve detection of prokaryotic tRNAs. Use of this mode with
prokaryotic sequences will also improve bounds prediction of the 3' end (the terminal CAA
triplet).

-A This option selects an archaeal-specific covariance model for tRNA analysis, as well as slightly
loosening the EufindtRNA search cutoffs.

-O This parameter bypasses the fast first-pass scanners that are poor at detecting organellar tRNAs
and runs Cove analysis only. Since true organellar tRNAs have been found to have Cove scores
between 15 and 20 bits, the search cutoff is lowered from 20 to 15 bits. Also, pseudogene
checking is disabled since it is only applicable to eukaryotic cytoplasmic tRNA pseudogenes.
Since Cove-only mode is used, searches will be very slow (see -C option below) relative to the
default mode.

-G This option selects the general tRNA covariance model that was trained on tRNAs from all three
phylogenetic domains (archaea, bacteria, & eukarya). This mode can be used when analyzing a mixed
collection of sequences from more than one phylogenetic domain, with only slight loss of
sensitivity and selectivity. The original publication describing this program and tRNAscan-SE
version 1.0 used this general tRNA model exclusively. If you wish to compare scores to those
found in the paper or scans using v1.0, use this option. Use of this option is compatible with
all other search mode options described in this section.

-C Directs tRNAscan-SE to analyze sequences using Cove analysis only. This option allows a slightly
more sensitive search than the default tRNAscan + EufindtRNA -> Cove mode, but is much slower (by
approx. 250 to 3,000 fold). Output format and other program defaults are otherwise identical to
the normal analysis.

-H This option displays the breakdown of the two components of the covariance model bit score. Since
tRNA pseudogenes often have one very low component (good secondary structure but poor primary
sequence similarity to the tRNA model, or vice versa), this information may be useful in deciding
whether a low-scoring tRNA is likely to be a pseudogene. The heuristic pseudogene detection
filter uses this information to flag possible pseudogenes -- use this option to see why a hit is
marked as a possible pseudogene. The user may wish to examine score breakdowns from known tRNAs
in the organism of interest to get a frame of reference.

-D Manually disable checking tRNAs for poor primary or secondary structure scores often indicative of
eukaryotic pseudogenes. This will slightly speed the program & may be necessary for non-
eukaryotic sequences that are flagged as possible pseudogenes but are known to be functional
tRNAs.

-o <file>
Output final results to <file>.

-f <file>
Save final results and Cove tRNA secondary structure predictions to <file>. This output format
makes visual inspection of individual tRNA predictions easier since the tRNA sequence is displayed
along with the predicted tRNA base pairings.

-a Output final results in ACeDB format instead of the default tabular format.

-m <file>
Save statistics summary for run. This option directs tRNAscan-SE to write a brief summary to
<file> which contains the run options selected as well as statistics on the number of tRNAs
detected at each phase of the search, search speed, and other bits of information. See Manual
documentation for explanation of each statistic.

-d Display program progress. Messages indicating which phase of the tRNA search are printed to
standard output. If final results are also being sent to standard output, some of these messages
will be suppressed so as to not interrupt display of the results.

-l <file>
Save log of program progress in <file>. Identical to -d option, but sends message to <file>
instead of standard output. Note: the -d option overrides the -l option if both are specified on
the same command line.

-q Quiet mode: the credits & run option selections normally printed to standard error at the
beginning of each run are suppressed.

-b Use brief output format. This eliminates column headers that appear by default when writing
results in tabular output format. Useful if results are to be parsed or piped to another program.

-N This option causes tRNAscan-SE to output a tRNA's corresponding codon in place of its anticodon.

-(Option)#
The '#' symbol may be used as shorthand to specify "default" file names for output files. The
default file names are constructed by using the input sequence file name, followed by an extension
specifying the output file type <seqfile.ext> where '.ext' is:

Extension Option Description
--------- ------ -----------
.out -o final results
.stats -m summary statistics file
.log -l run progress file
.ss -f secondary structures save file
.fpass.out -r formatted, tabular output
from first-pass scans
.fpos -F FASTA file of tRNAs identified in first-pass scans that were
found to be false positives by Cove analysis

Notes:

1) If the input sequence file name has the extensions '.fa' or '.seq', these extensions will be
removed before using the filename as a prefix for default file names. (example -- input file name
Mygene.seq will have the output file name Mygene.out if the '-o#' option is used).

2) If more than one sequence file is specified on the command line, the "default" output file
prefix will be the name of the FIRST sequence file on the command line. Use the -p option to
change this default name to something more appropriate when using more than one sequence file on
the command line.

-p <label>
Use <label> prefix as the default output file prefix when using '#' for file name specification.
<label> is used in place of the input sequence file name.

-y This option displays which of the first-pass scanners detected the tRNA being output. "Ts", "Eu",
or "Bo" will appear in the last column of Tabular output, indicating that either tRNAscan 1.4,
EufindtRNA, or both scanners detected the tRNA, respectively.

-X <score>
Set Cove cutoff score for reporting tRNAs (default=20). This option allows the user to specify a
different Cove score threshold for reporting tRNAs. It is not recommended that novice users
change this cutoff, as a lower cutoff score will increase the number of pseudogenes and other
false positives found by tRNAscan-SE (especially when used with the "Cove only" scan mode).
Conversely, a higher cutoff than 20.0 bits will likely cause true tRNAs to be missed by tRNAscan
(numerous "real" tRNAs have been found just above the 20.0 cutoff). Knowledgable users may wish
to experiment with this parameter to find very unusual tRNAs or pseudogenes beyond the normal
range of detection with the preceding caveats in mind.

-L <length>
Set max length of tRNA intron+variable region (default=116bp). The default maximum tRNA length
for tRNAscan-SE is 192 bp, but this limit can be increased with this option to allow searches with
no practical limit on tRNA length. In the first phase of tRNAscan-SE, EufindtRNA searches for A
and B boxes of <length> maximum distance apart, and passes only the 5' and 3' tRNA ends to
covariance model analysis for confirmation (removing the bulk of long intervening sequences).
tRNAs containing group I and II introns have been detected by setting this parameter to over 800
bp. Caution: group I or II introns in tRNAs tend to occur in positions other than the canonical
position of protein-spliced introns, so tRNAscan-SE mispredicts the intron bounds and anticodon
sequence for these cases. tRNA bound predictions, however, have been found to be reliable in
these same tRNAs.

-I <score>
This score cutoff affects the sensitivity of the first-pass scanner EufindtRNA. This parameter
should not need to be adjusted from its default values (variable depending on search mode), but is
included for users who are familiar with the Pavesi et al. (1994) paper and wish to set it
manually. See Lowe & Eddy (1997) for details on parameter values used by tRNAscan-SE depending on
the search mode.

-B <number>
By default, tRNAscan-SE adds 7 nucleotides to both ends of tRNA predictions when first-pass tRNA
predictions are passed to covariance model (CM) analysis. CM analysis generally trims these
bounds back down, but on occassion, allows prediction of an otherwise truncated first-pass tRNA
prediction.

-g <file>
Use exceptions to "universal" genetic code specified in <file>. By default, tRNAscan-SE uses a
standard universal codon -> amino acid translation table that is specified at the end of the
tRNAscan-SE.src source file. This option allows the user to specify exceptions to the default
translation table. The user may use any one of several alternate translation code files included
in this package (see files 'gcode.*'), or create a new alternate translation file. See Manual
documentation for specification of file format, or refer to included examples files.

Note: this option does not have any effect when using the -T or -E options -- you must be running
in default or Cove only analysis mode.

-c <file>
For users who have developed their own tRNA covariance models using the Cove program "coveb" (see
Cove documentation), this parameter allows substitution for the default tRNA covariance models.
May be useful for extending Cove-only mode detection of particularly strange tRNA species such as
mitochondrial tRNAs.

-Q By default, if an output result file to be written to already exists, the user is prompted whether
the file should be over-written or appended to. Using this options forces overwriting of pre-
existing files without an interactive prompt. This option may be handy for batch-processing and
running tRNAscan-SE in the background.

-n <EXPR>
Search only sequences with names matching <EXPR> string. <EXPR> may contain * or ? wildcard
characters, but the user should remember to enclose these expressions in single quotes to avoid
shell expansion. Only those sequences with names (first non-white space word after ">" symbol on
FASTA name/description line) matching <EXPR> are analyzed for tRNAs.

-s <EXPR>
Start search at first sequence with name matching <EXPR> string and continue to end of input
sequence file(s). This may be useful for re-starting crashed/aborted runs at the point where the
previous run stopped. (If same names for output file(s) are used, program will ask if files
should be over-written or appended to -- choose append and run will successfully be restarted
where it left off).

-T Directs tRNAscan-SE to use only tRNAscan to analyze sequences. This mode will default to using
"strict" parameters with tRNAscan analysis (similar to tRNAscan version 1.3 operation). This mode
of operation is faster (3-5 times faster than default mode analysis), but will result in
approximately 0.2 to 0.6 false positive tRNAs per Mbp, decreased sensitivity, and less reliable
prediction of anticodons, tRNA isotype, and introns.

-t <mode>
Explicitly set tRNAscan params, where <mode> = R or S (R=relaxed, S=strict tRNAscan v1.3 params).
This option allows selection of strict or relaxed search parameters for tRNAscan analysis. By
default, "strict" parameters are used. Relaxed parameters may give very slightly increased search
sensitivity, but increase search time by 20-40 fold.

-E Run EufindtRNA alone to search for tRNAs. Since Cove is not being used as a secondary filter to
remove false positives, this run mode defaults to "Normal" parameters which more closely
approximates the sensitivity and selectivity of the original algorithm describe by Pavesi and
colleagues (see the next option, -e for a description of the various run modes).

-e <mode>
Explicitly set EufindtRNA params, where <mode>= R, N, or S (relaxed, normal, or strict). The
"relaxed" mode is used for EufindtRNA when using tRNAscan-SE in default mode. With relaxed
parameters, tRNAs that lack pol III poly-T terminators are not penalized, increasing search
sensitivity, but decreasing selectivity. When Cove analysis is being used as a secondary filter
for false positives (as in tRNAscan-SE's default mode), overall selectivity is not decreased.

Using "normal" parameters with EufindtRNA does incorporate a log odds score for the distance
between the B box and the first poly-T terminator, but does not disqualify tRNAs that do not have
a terminator signal within 60 nucleotides. This mode is used by default when Cove analysis is not
being used as a secondary false positive filter.

Using "strict" parameters with EufindtRNA also incorporates a log odds score for the distance
between the B box and the first poly-T terminator, but _rejects_ tRNAs that do not have such a
signal within 60 nucleotides of the end of the B box. This mode most closely approximates the
originally published search algorithm (3); sensitivity is reduced relative to using "relaxed" and
"normal" modes, but selectivity is increased which is important if no secondary filter, such as
Cove analysis, is being used to remove false positives. This mode will miss most prokaryotic
tRNAs since the poly-T terminator signal is a feature specific to eukaryotic tRNAs genes (always
use "relaxed" mode for scanning prokaryotic sequences for tRNAs).

-r <file>
Save tabular, formatted output results from tRNAscan and/or EufindtRNA first pass scans in <file>.
The format is similar to the final tabular output format, except no Cove score is available at
this point in the search (if EufindtRNA has detected the tRNA, the negative log likelihood score
is given). Also, the sequence ID number and source sequence length appear in the columns where
intron bounds are shown in final output. This option may be useful for examining false positive
tRNAs predicted by first-pass scans that have been filtered out by Cove analysis.

-u <file>
This option allows the user to re-generate results from regions identified to have tRNAs by a
previous tRNAscan-SE run. Either a regular tabular result file, or output saved with the -r
option may be used as the specified <file>. This option is particularly useful for generating
either secondary structure output (-f option) or ACeDB output (-a option) without having to re-
scan entire sequences. Alternatively, if the -r option is used to generate the previous results
file, tRNAscan-SE will pick up at the stage of Cove-confirmation of tRNAs and output final tRNA
predicitons as with a normal run.

Note: the -n and -s options will not work in conjunction with this option.

-F <file>
Save first-pass candidate tRNAs in <file> that were then found to be false positives by Cove
analysis. This option saves candidate tRNAs found by either tRNAscan and/or EufindtRNA that were
then rejected by Cove analysis as being false positives. tRNAs are saved in the FASTA sequence
format.

-M <file>
This option may be used when scanning a collection of known tRNA sequences to identify possible
false negatives (incorreclty missed by tRNAscan-SE) or sequences incorrectly annotated as tRNAs
(correctly passed over by tRNAscan-SE). Examination of primary & secondary structure covariance
model scores (-H option), and visual inspection of secondary structures (use -F option) may be
helpful resolving identification conflicts.

BUGS

       No major bugs known.

NOTES

       This  software  and  documentation  is  Copyright  (C) 1996, Todd M.J. Lowe & Sean R. Eddy.  It is freely
       distributable under  terms  of  the  GNU  General  Public  License.  See  COPYING,  in  the  source  code
       distribution, for more details, or contact me.

       Todd Lowe
       Dept. of Genetics, Washington Univ. School of Medicine
       660 S. Euclid Box 8232
       St Louis, MO 63110 USA
       Phone: 1-314-362-7667
       FAX  : 1-314-362-2985
       Email: lowe@genetics.wustl.edu

REFERENCES

       1.  Fichant,  G.A.  and  Burks, C. (1991) "Identifying potential tRNA genes in genomic DNA sequences", J.
       Mol. Biol., 220, 659-671.

       2. Eddy, S.R. and Durbin, R. (1994) "RNA sequence analysis using covariance models",  Nucl.  Acids  Res.,
       22, 2079-2088.

       3.  Pavesi,  A.,  Conterio,  F.,  Bolchi,  A.,  Dieci,  G.,  Ottonello,  S. (1994) "Identification of new
       eukaryotic tRNA genes in genomic DNA databases by a multistep weight matrix analysis  of  transcriptional
       control regions", Nucl. Acids Res., 22, 1247-1256.

       4. Lowe, T.M. & Eddy, S.R. (1997) "tRNAscan-SE: A program for improved detection of transfer RNA genes in
       genomic sequence", Nucl. Acids Res., 25, 955-964.