Provided by: trnascan-se_1.3.1-1_amd64
NAME
tRNAscan-SE - improved detection of transfer RNA genes in genomic sequence
SYNOPSIS
tRNAscan-SE [options] seqfile(s)
DESCRIPTION
tRNAscan-SE searches for transfer RNAs in genomic sequence seqfile(s) using three separate methods to achieve a combination of speed, sensitivity, and selectivity not available with each program individually. tRNAscan-SE was written in the PERL (version 5.0) script language. Input consists of DNA or RNA sequences in FASTA format. tRNA predictions are output in standard tabular, ACeDB- compatible, or an extended format including tRNA secondary structure information. tRNAscan-SE does no tRNA detection itself, but instead combines the strengths of three independent tRNA prediction programs by negotiating the flow of information among them, performing a limited amount of post-processing, and outputting the result. tRNAscan-SE combines the specificity of the Cove probabilistic RNA prediction package (Eddy & Durbin, 1994) with the speed and sensitivity of tRNAscan 1.3 (Fichant & Burks, 1991) plus an implementation of an algorithm described by Pavesi and colleagues (1994) which searches for eukaryotic pol III tRNA promoters (our implementation referred to as EufindtRNA). tRNAscan and EufindtRNA are used as first-pass prefilters to identify "candidate" tRNA regions of the sequence. These subsequences are then passed to Cove for further analysis, and output if Cove confirms the initial tRNA prediction. In this way, tRNAscan-SE attains the best of both worlds: (1) a false positive rate equally low to using Cove analysis, (2) the combined sensitivities of tRNAscan and EufindtRNA (detection of 99% of true tRNAs), and (3) search speed 1,000 to 3,000 times faster than Cove analysis and 30 to 90 times faster than the original tRNAscan 1.3 (tRNAscan-SE uses both a code- optimized version of tRNAscan 1.3 which gives a 650-fold increase in speed, and a fast C implementation of the Pavesi et al. algorithm). tRNAscan-SE was designed to make rapid, sensitive searches of genomic sequence feasible using the selectivity of the Cove analysis package. Search sensitivity was optimized with eukaryote cytoplasmic & eubacterial sequences, but it may be applied more broadly with a slight reduction in sensitivity. In the default tabular output format, each new tRNA in a sequence is consecutively numbered in the 'tRNA #' column. 'tRNA Bounds' specify the starting (5') and ending (3') nucleotide bounds for the tRNA. tRNAs found on the reverse (lower) strand are indicated by having the Begin (5') bound greater than the End (3') bound. The 'tRNA Type' is the predicted amino acid charged to the tRNA molecule based on the predicted anticodon (written 5'->3') displayed in the next column. tRNAs that fit criteria for potential pseudogenes (poor primary or secondary structure), will be marked with "Pseudo" in the 'tRNA Type' column (pseudogene checking is further discussed in the Methods section of the program manual). If there is a predicted intron in the tRNA, the next two columns indicate the nucleotide bounds. If there is no predicted intron, both of these columns contain zero. The final column is the Cove score for the tRNA in bits of information. Specifically, it is a log-odds score: the log of the ratio of the probability of the sequence given the tRNA covariance model used (developed from hand-alignment of 1415 tRNAs), and the probability of the sequence given a simple random sequence model. tRNAscan-SE counts any sequence that attains a score of 20.0 bits or larger as a tRNA (based on empirical studies conducted by Eddy & Durbin in ref #2).
OPTIONS
-h Prints entire list of program options, each with a brief, one-line description. -P This option selects the prokaryotic covariace model for tRNA analysis, and loosens the search parameters for EufindtRNA to improve detection of prokaryotic tRNAs. Use of this mode with prokaryotic sequences will also improve bounds prediction of the 3' end (the terminal CAA triplet). -A This option selects an archaeal-specific covariance model for tRNA analysis, as well as slightly loosening the EufindtRNA search cutoffs. -O This parameter bypasses the fast first-pass scanners that are poor at detecting organellar tRNAs and runs Cove analysis only. Since true organellar tRNAs have been found to have Cove scores between 15 and 20 bits, the search cutoff is lowered from 20 to 15 bits. Also, pseudogene checking is disabled since it is only applicable to eukaryotic cytoplasmic tRNA pseudogenes. Since Cove-only mode is used, searches will be very slow (see -C option below) relative to the default mode. -G This option selects the general tRNA covariance model that was trained on tRNAs from all three phylogenetic domains (archaea, bacteria, & eukarya). This mode can be used when analyzing a mixed collection of sequences from more than one phylogenetic domain, with only slight loss of sensitivity and selectivity. The original publication describing this program and tRNAscan-SE version 1.0 used this general tRNA model exclusively. If you wish to compare scores to those found in the paper or scans using v1.0, use this option. Use of this option is compatible with all other search mode options described in this section. -C Directs tRNAscan-SE to analyze sequences using Cove analysis only. This option allows a slightly more sensitive search than the default tRNAscan + EufindtRNA -> Cove mode, but is much slower (by approx. 250 to 3,000 fold). Output format and other program defaults are otherwise identical to the normal analysis. -H This option displays the breakdown of the two components of the covariance model bit score. Since tRNA pseudogenes often have one very low component (good secondary structure but poor primary sequence similarity to the tRNA model, or vice versa), this information may be useful in deciding whether a low-scoring tRNA is likely to be a pseudogene. The heuristic pseudogene detection filter uses this information to flag possible pseudogenes -- use this option to see why a hit is marked as a possible pseudogene. The user may wish to examine score breakdowns from known tRNAs in the organism of interest to get a frame of reference. -D Manually disable checking tRNAs for poor primary or secondary structure scores often indicative of eukaryotic pseudogenes. This will slightly speed the program & may be necessary for non-eukaryotic sequences that are flagged as possible pseudogenes but are known to be functional tRNAs. -o <file> Output final results to <file>. -f <file> Save final results and Cove tRNA secondary structure predictions to <file>. This output format makes visual inspection of individual tRNA predictions easier since the tRNA sequence is displayed along with the predicted tRNA base pairings. -a Output final results in ACeDB format instead of the default tabular format. -m <file> Save statistics summary for run. This option directs tRNAscan-SE to write a brief summary to <file> which contains the run options selected as well as statistics on the number of tRNAs detected at each phase of the search, search speed, and other bits of information. See Manual documentation for explanation of each statistic. -d Display program progress. Messages indicating which phase of the tRNA search are printed to standard output. If final results are also being sent to standard output, some of these messages will be suppressed so as to not interrupt display of the results. -l <file> Save log of program progress in <file>. Identical to -d option, but sends message to <file> instead of standard output. Note: the -d option overrides the -l option if both are specified on the same command line. -q Quiet mode: the credits & run option selections normally printed to standard error at the beginning of each run are suppressed. -b Use brief output format. This eliminates column headers that appear by default when writing results in tabular output format. Useful if results are to be parsed or piped to another program. -N This option causes tRNAscan-SE to output a tRNA's corresponding codon in place of its anticodon. -(Option)# The '#' symbol may be used as shorthand to specify "default" file names for output files. The default file names are constructed by using the input sequence file name, followed by an extension specifying the output file type <seqfile.ext> where '.ext' is: Extension Option Description --------- ------ ----------- .out -o final results .stats -m summary statistics file .log -l run progress file .ss -f secondary structures save file .fpass.out -r formatted, tabular output from first-pass scans .fpos -F FASTA file of tRNAs identified in first-pass scans that were found to be false positives by Cove analysis Notes: 1) If the input sequence file name has the extensions '.fa' or '.seq', these extensions will be removed before using the filename as a prefix for default file names. (example -- input file name Mygene.seq will have the output file name Mygene.out if the '-o#' option is used). 2) If more than one sequence file is specified on the command line, the "default" output file prefix will be the name of the FIRST sequence file on the command line. Use the -p option to change this default name to something more appropriate when using more than one sequence file on the command line. -p <label> Use <label> prefix as the default output file prefix when using '#' for file name specification. <label> is used in place of the input sequence file name. -y This option displays which of the first-pass scanners detected the tRNA being output. "Ts", "Eu", or "Bo" will appear in the last column of Tabular output, indicating that either tRNAscan 1.4, EufindtRNA, or both scanners detected the tRNA, respectively. -X <score> Set Cove cutoff score for reporting tRNAs (default=20). This option allows the user to specify a different Cove score threshold for reporting tRNAs. It is not recommended that novice users change this cutoff, as a lower cutoff score will increase the number of pseudogenes and other false positives found by tRNAscan-SE (especially when used with the "Cove only" scan mode). Conversely, a higher cutoff than 20.0 bits will likely cause true tRNAs to be missed by tRNAscan (numerous "real" tRNAs have been found just above the 20.0 cutoff). Knowledgable users may wish to experiment with this parameter to find very unusual tRNAs or pseudogenes beyond the normal range of detection with the preceding caveats in mind. -L <length> Set max length of tRNA intron+variable region (default=116bp). The default maximum tRNA length for tRNAscan-SE is 192 bp, but this limit can be increased with this option to allow searches with no practical limit on tRNA length. In the first phase of tRNAscan-SE, EufindtRNA searches for A and B boxes of <length> maximum distance apart, and passes only the 5' and 3' tRNA ends to covariance model analysis for confirmation (removing the bulk of long intervening sequences). tRNAs containing group I and II introns have been detected by setting this parameter to over 800 bp. Caution: group I or II introns in tRNAs tend to occur in positions other than the canonical position of protein-spliced introns, so tRNAscan-SE mispredicts the intron bounds and anticodon sequence for these cases. tRNA bound predictions, however, have been found to be reliable in these same tRNAs. -I <score> This score cutoff affects the sensitivity of the first-pass scanner EufindtRNA. This parameter should not need to be adjusted from its default values (variable depending on search mode), but is included for users who are familiar with the Pavesi et al. (1994) paper and wish to set it manually. See Lowe & Eddy (1997) for details on parameter values used by tRNAscan-SE depending on the search mode. -B <number> By default, tRNAscan-SE adds 7 nucleotides to both ends of tRNA predictions when first-pass tRNA predictions are passed to covariance model (CM) analysis. CM analysis generally trims these bounds back down, but on occassion, allows prediction of an otherwise truncated first-pass tRNA prediction. -g <file> Use exceptions to "universal" genetic code specified in <file>. By default, tRNAscan-SE uses a standard universal codon -> amino acid translation table that is specified at the end of the tRNAscan-SE.src source file. This option allows the user to specify exceptions to the default translation table. The user may use any one of several alternate translation code files included in this package (see files 'gcode.*'), or create a new alternate translation file. See Manual documentation for specification of file format, or refer to included examples files. Note: this option does not have any effect when using the -T or -E options -- you must be running in default or Cove only analysis mode. -c <file> For users who have developed their own tRNA covariance models using the Cove program "coveb" (see Cove documentation), this parameter allows substitution for the default tRNA covariance models. May be useful for extending Cove-only mode detection of particularly strange tRNA species such as mitochondrial tRNAs. -Q By default, if an output result file to be written to already exists, the user is prompted whether the file should be over-written or appended to. Using this options forces overwriting of pre-existing files without an interactive prompt. This option may be handy for batch-processing and running tRNAscan-SE in the background. -n <EXPR> Search only sequences with names matching <EXPR> string. <EXPR> may contain * or ? wildcard characters, but the user should remember to enclose these expressions in single quotes to avoid shell expansion. Only those sequences with names (first non-white space word after ">" symbol on FASTA name/description line) matching <EXPR> are analyzed for tRNAs. -s <EXPR> Start search at first sequence with name matching <EXPR> string and continue to end of input sequence file(s). This may be useful for re-starting crashed/aborted runs at the point where the previous run stopped. (If same names for output file(s) are used, program will ask if files should be over-written or appended to -- choose append and run will successfully be restarted where it left off). -T Directs tRNAscan-SE to use only tRNAscan to analyze sequences. This mode will default to using "strict" parameters with tRNAscan analysis (similar to tRNAscan version 1.3 operation). This mode of operation is faster (3-5 times faster than default mode analysis), but will result in approximately 0.2 to 0.6 false positive tRNAs per Mbp, decreased sensitivity, and less reliable prediction of anticodons, tRNA isotype, and introns. -t <mode> Explicitly set tRNAscan params, where <mode> = R or S (R=relaxed, S=strict tRNAscan v1.3 params). This option allows selection of strict or relaxed search parameters for tRNAscan analysis. By default, "strict" parameters are used. Relaxed parameters may give very slightly increased search sensitivity, but increase search time by 20-40 fold. -E Run EufindtRNA alone to search for tRNAs. Since Cove is not being used as a secondary filter to remove false positives, this run mode defaults to "Normal" parameters which more closely approximates the sensitivity and selectivity of the original algorithm describe by Pavesi and colleagues (see the next option, -e for a description of the various run modes). -e <mode> Explicitly set EufindtRNA params, where <mode>= R, N, or S (relaxed, normal, or strict). The "relaxed" mode is used for EufindtRNA when using tRNAscan-SE in default mode. With relaxed parameters, tRNAs that lack pol III poly-T terminators are not penalized, increasing search sensitivity, but decreasing selectivity. When Cove analysis is being used as a secondary filter for false positives (as in tRNAscan-SE's default mode), overall selectivity is not decreased. Using "normal" parameters with EufindtRNA does incorporate a log odds score for the distance between the B box and the first poly-T terminator, but does not disqualify tRNAs that do not have a terminator signal within 60 nucleotides. This mode is used by default when Cove analysis is not being used as a secondary false positive filter. Using "strict" parameters with EufindtRNA also incorporates a log odds score for the distance between the B box and the first poly-T terminator, but _rejects_ tRNAs that do not have such a signal within 60 nucleotides of the end of the B box. This mode most closely approximates the originally published search algorithm (3); sensitivity is reduced relative to using "relaxed" and "normal" modes, but selectivity is increased which is important if no secondary filter, such as Cove analysis, is being used to remove false positives. This mode will miss most prokaryotic tRNAs since the poly-T terminator signal is a feature specific to eukaryotic tRNAs genes (always use "relaxed" mode for scanning prokaryotic sequences for tRNAs). -r <file> Save tabular, formatted output results from tRNAscan and/or EufindtRNA first pass scans in <file>. The format is similar to the final tabular output format, except no Cove score is available at this point in the search (if EufindtRNA has detected the tRNA, the negative log likelihood score is given). Also, the sequence ID number and source sequence length appear in the columns where intron bounds are shown in final output. This option may be useful for examining false positive tRNAs predicted by first-pass scans that have been filtered out by Cove analysis. -u <file> This option allows the user to re-generate results from regions identified to have tRNAs by a previous tRNAscan-SE run. Either a regular tabular result file, or output saved with the -r option may be used as the specified <file>. This option is particularly useful for generating either secondary structure output (-f option) or ACeDB output (-a option) without having to re-scan entire sequences. Alternatively, if the -r option is used to generate the previous results file, tRNAscan-SE will pick up at the stage of Cove-confirmation of tRNAs and output final tRNA predicitons as with a normal run. Note: the -n and -s options will not work in conjunction with this option. -F <file> Save first-pass candidate tRNAs in <file> that were then found to be false positives by Cove analysis. This option saves candidate tRNAs found by either tRNAscan and/or EufindtRNA that were then rejected by Cove analysis as being false positives. tRNAs are saved in the FASTA sequence format. -M <file> This option may be used when scanning a collection of known tRNA sequences to identify possible false negatives (incorreclty missed by tRNAscan-SE) or sequences incorrectly annotated as tRNAs (correctly passed over by tRNAscan-SE). Examination of primary & secondary structure covariance model scores (-H option), and visual inspection of secondary structures (use -F option) may be helpful resolving identification conflicts.
SEE ALSO
User Manual and tutorial: Manual.ps (postscript), MANUAL (text)
BUGS
No major bugs known.
NOTES
This software and documentation is Copyright (C) 1996, Todd M.J. Lowe & Sean R. Eddy. It is freely distributable under terms of the GNU General Public License. See COPYING, in the source code distribution, for more details, or contact me. Todd Lowe Dept. of Genetics, Washington Univ. School of Medicine 660 S. Euclid Box 8232 St Louis, MO 63110 USA Phone: 1-314-362-7667 FAX : 1-314-362-2985 Email: lowe@genetics.wustl.edu
REFERENCES
1. Fichant, G.A. and Burks, C. (1991) "Identifying potential tRNA genes in genomic DNA sequences", J. Mol. Biol., 220, 659-671. 2. Eddy, S.R. and Durbin, R. (1994) "RNA sequence analysis using covariance models", Nucl. Acids Res., 22, 2079-2088. 3. Pavesi, A., Conterio, F., Bolchi, A., Dieci, G., Ottonello, S. (1994) "Identification of new eukaryotic tRNA genes in genomic DNA databases by a multistep weight matrix analysis of transcriptional control regions", Nucl. Acids Res., 22, 1247-1256. 4. Lowe, T.M. & Eddy, S.R. (1997) "tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence", Nucl. Acids Res., 25, 955-964.