Vol. 24 no.
2 2008, pages 172–175
BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm573
Sequence analysis
ARWEN: a program to detect tRNA genes in metazoan
mitochondrial nucleotide sequences
Dean Laslett1 and Björn Canbäck2,*
1
Murdoch University, Perth, Western Australia, Australia and 2Björn Canbäck Bioinformatik, Vindögatan 66,
257 33 Rydebäck, Sweden
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/24/2/172/228155 by guest on 16 October 2018
Received on June 21, 2007; revised and accepted on November 13, 2007
Advance Access publication November 22, 2007
Associate Editor: Thomas Lengauer
ABSTRACT tRNAscan-SE (Lowe and Eddy, 1997), using standard settings
Motivation: Mitochondrial genomes encode their own transfer RNAs detects only 1 gene.
(tRNAs). These are often degenerate in sequence and structure The purpose of this study is to develop a heuristic algorithm
compared to tRNAs in their bacterial ancestors. This is one of the to search in silico for metazoan tRNA genes. The software
reasons why current tRNA gene predictor programs perform poorly should have a detection rate close to 100% even if this results in
identifying mitochondrial tRNA genes. As a consequence there is a a number of falsely predicted tRNAs since these can be easily
need for a new program with the specific aim of predicting these removed when analyzing the genome sequence. The program
tRNAs. should be user friendly with a limited number of (user)
Results: In this study, we present the software ARWEN that identi- parameter settings, produce results that are easy to interpret,
fies tRNA genes in metazoan mitochondrial nucleotide sequences. and a website should be available for the user to perform online
ARWEN detects close to 100% of previously annotated genes. analysis. The ARWEN program successfully fulfills all of these
Availability: An online version, software for download and test requirements.
results are available at www.acgt.se/online.html
Contact:
[email protected] 2 METHODS
Supplementary information: Supplementary data are available at
Bioinformatics online. 2.1 Search algorithm
ARWEN employs a heuristic algorithm that searches for hairpin
structures with a 5–6 base-pair (bp) stem and a 6–8 base loop that could
1 INTRODUCTION be a candidate C-arm. For every candidate C-arm, the upstream
sequence is searched for possible D-arm structures (2–5 bp stem and
The mitochondrial transfer RNA (tRNA) genes of many
loop of 3–17 bases) and the downstream sequence for possible T-arm
metazoan (including mammalian) species exhibit less confor-
structures (2–7 bp stem and loop of 3–31 bases). Both upstream and
mity to the canonical cloverleaf secondary tRNA structure, and downstream sequences are assessed for base pairing interactions that
less homology to recognized tRNA consensus motifs, than could indicate the presence of an A-stem (5–8 bp). ARWEN then
cytosolic or prokaryotic tRNA genes, to the extent that these attempts to combine these structures into a complete tRNA gene
mitochondrial genes are referred to as ‘bizarre’ (Helm et al., containing at least three out of four of these structures. Unlike
2000). Smaller than usual dihydrouridine (D) stems and loops; ARAGORN, ARWEN does not allow for the presence of introns in
smaller than usual TjC (T) stems and loops; changes in the the C-loop.
number of connector bases between the acceptor (A) stem and Three different algorithms are used, one for each type of tRNA
(D-replacement loop, TV-replacement loop and standard cloverleaf),
D-stem, and D-stem and anticodon (C) stem; and elongated
to assign a score to a candidate sequence. Because a standard cloverleaf
C-stems, have been reported, but perhaps the most astounding
tRNA can also form a D- or TV-replacement loop tRNA by opening
feature in some of these tRNA genes is the complete absence one hairpin, the initial score for the cloverleaf type is set higher to favor
of either the whole D-arm, or the whole T-arm and variable the formation of a full cloverleaf if possible.
(V) loop. These are replaced by short sequences, called the The presence of TA in the spacer sequence between A-stem and
D-replacement loop and TV-replacement loop, respectively. D-arm, CT5nnn4AA in the C-arm (where 5nnn4 denotes the
Furthermore, mitochondria often use a genetic code that differs anticodon sequence), and GTTC homology in the T-arm (when
from the universal genetic code. present) are given extra points, respectively. However, the importance
For these reasons, conventional tRNA detection programs of stem base-pairing interactions and tertiary structure interactions is
perform poorly. For example, the ARAGORN tRNA detec- increased compared to the ARAGORN program, and the importance
of consensus sequence homology is reduced.
tion program (Laslett and Canback, 2004) detects only 3 of the
In all stems, GC bonds are considered to be the most thermo-
22 tRNA genes in the Homo sapiens mitochondrion, and dynamically stable, and are not penalized. AT bonds, GT bonds and
non-bonding base pairs are penalized in order of increasing magnitude.
*To whom correspondence should be addressed. The score for each stem is further modified by stem termination and
172 ß The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
ARWEN
stem opening base combinations (the last base pairs at either end of the 3 RESULTS
stem and one base beyond), tandem repeats and nearest neighbor
interactions within the stem, and lengths of stem and loop for the T-arm
When running ARWEN on sequences from the Mamit-tRNA
that differ from the canonical values (5 bp stem and 7 base loop). database set, a detection sensitivity of 100% was achieved
Sequences with a high (450%) or low (510%) overall GC content are (tRNAscan-SE: 93.6%). A 99.1% detection sensitivity was
also penalized. achieved for the corrected OGRe database set (tRNAscan-SE:
The three different algorithms then use 40, 32 and 65 different partial 82.8%).
combinations respectively of stem base pairing, tertiary interactions and When comparing ARWEN with tRNAscan-SE against
consensus sequence motifs to further modify the score. the set of 23 mitochondrial genomes containing 469 tRNAs
If the final score from each algorithm is greater than a cutoff according to RefSeq annotations, ARWEN achieved a 99.4%
threshold value, then the candidate sequence is accepted as a possible detection sensitivity. The corresponding value for tRNAscan-
tRNA gene. Since D-loop replacement tRNA genes were given the
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/24/2/172/228155 by guest on 16 October 2018
SE was 72.1%. ARWEN completed scanning the 23 genomes
lowest initial score, they have the lowest cutoff value, followed by
almost 30 times faster than tRNAscan-SE (total run-time
TV-loop replacement genes, and then full cloverleaf genes.
3.29 min on an Intel Centrino Duo 1.6 GHz CPU, 512 MB
The precise scoring magnitudes and cut-off thresholds within each
algorithm were calibrated manually to achieve maximum sensitivity memory running cygwin 1.5.24-2 under windows XP-SP2;
using a training set of 125 annotated metazoan mitochondrial genomes compared to 96.09 min total run-time for tRNAscan-SE).
with a total sequence length of 1.98 Megabases. The chosen calibration The number of reported false positives when using ARWEN
generated 39 false negatives and 370 false positives from 2696 annotated was 115. The corresponding value for tRNAscan-SE was 8.
tRNA genes, giving a sensitivity of 98.5% and a selectivity of 186.6 When the threshold values in ARWEN were raised until the
per Megabase, or almost 3 false positives per genome on average. total number of false positives was also 8 (to give an overall
The typical selectivity for the ARAGORN program was reported as selectivity equivalent to tRNAscan-SE), then the overall
0.0035 per Megabase (Laslett and Canback, 2004). Hence, ARWEN sensitivity was 81.7%, almost 10% points above tRNAscan-SE.
is not suitable for use on longer cytoplasmic genomes. Test results are summarized in Table 1. For more detail,
please see Supplementary Material found at www.acgt.se/
online.html.
2.2 Testing the algorithm
The ARWEN algorithm described here is tested against a number
of datasets: 4 IMPLEMENTATION
(1) A set of mammalian mitochondrial tRNA gene sequences from ARWEN is written in C. The source code can be downloaded
the Mamit-tRNA database (at http://mamit-trna.u-strasbg.fr/). from the website. The website also contains a user interface to
The set contains 678 tRNA gene sequences from 31 mammalian the program allowing the user to upload a sequence and run the
mitochondrial genomes arranged into 22 groups according to program on a server. Aggregate sequence lengths up to 2 MB
amino acid specificity. One incomplete sequence (tRNA-SerGCT are allowed. ARWEN accepts as input a file with one or more
from Didelphis virginiana) was removed, and two sequences nucleotide sequences in FASTA format. By default, ARWEN
with errors in the C-loop (tRNA-PheGAA from Canis familiaris assumes that each sequence has a circular topology (search
and tRNA-LysTTT from Oryctolagus cuniculus) were corrected,
wraps around ends), that both strands should be searched, and
using sequences from the original complete mitochondrial
that the progress of the search is not reported. These settings
genomes accessed from GenBank, to leave a final set of
677 tRNA sequences. This set includes 30 tRNA-Ser with
can be individually changed to linear topology (no wrapping),
D-replacement loops. search of the sense strand only, and report of search progress.
For each candidate mitochondrial tRNA, secondary structure,
(2) A set of metazoan mitochondrial tRNA gene sequences from the
OGRe database (Jameson et al., 2003) at http://www.bioinf.
anti-codon position and amino acid iso-acceptor species are
man.ac.uk/ogre. This set was extensively corrected using predicted (Fig. 1). If there has been a deviation in the genetic
sequences from the original complete mitochondrial genomes. code for a particular anticodon triplet within kingdom
The set contained 10 346 tRNA gene sequences. metazoa, then more than one possible iso-acceptor species is
(3) A set of 23 mitochondrial genomes with RefSeq annotations reported. An abbreviated output format is also available. In
(Pruitt et al., 2005) that were randomly chosen by using stratified this case, for each sequence in the input file, only the sequence
selection to allow representatives from distant lineages with few name and tab delimited information about each gene detected
members. RefSeq is a comprehensive set of sequences that have in the sequence are given.
been manually curated by NCBI staff. The 23 genomes contain
469 tRNAs according to the RefSeq annotations. Detection rates
and rate of false positives are compared with those produced by 5 DISCUSSION
tRNAscan-SE (version 1.23). For the mammalian genomes,
The results for the Mamit tRNA gene database and the OGRe
ARWEN was invoked by the -mtmam switch (search for
database indicate that ARWEN identifies a mitochondrial
mammalian tRNAs) and the -gcmam switch (use the mammalian
mitochondrial genetic code). For all other genomes, ARWEN
tRNA in almost every case. ARWEN achieves a detection rate
was invoked with the -gcmet switch (use a composite metazoan close to 100% for these sequences.
mitochondrial genetic code). tRNAscan-SE was invoked by However, a high detection rate may lead to a high level
the -O option (search for organellar tRNAs) and the -g option of falsely predicted tRNAs (false positives). By testing
(use a specified alternate genetic code). None of the 23 genomes ARWEN and tRNAscan-SE on whole-genome sequences,
was included in the training set. a direct comparison of detection rates and number of false
173
D.Laslett and B.Canbäck
Table 1. ARWEN and tRNAscan-SE test results for 23 randomly chosen metazoan mitochondrial genomes with RefSeq annotations
Organism Reported number of tRNAs tRNAs not detected False positives
ARWEN tRNAscan-SE RefSeqa ARWEN tRNAscan-SE Bothb ARWEN tRNAscan-SE Bothc
Placozoa
Trichoplax adhaerens 45 24 22 0 0 0 23 2 1
Porifera
Axinella corrugata 38 25 25 0 0 0 13 0 0
Cnidaria
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/24/2/172/228155 by guest on 16 October 2018
Ricordea florida 5 2 2 0 0 0 3 0 0
Briareum asbestinum 7 1 1 0 0 0 6 0 0
Acoelomata
Echinococcus multilocularis 24 2 22 0 20 0 2 0 0
Schistosoma mansoni 23 8 23 1 15 0 1 0 0
Pseuodocoelomata
Ancylostoma duodenale 26 8 22 0 15 0 4 1 0
Ascaris suum 24 2 22 0 21 0 2 1 0
Mollusca
Crassostrea virginica 27 18 23 0 5 0 4 0 0
Arthropoda
Artemia franciscana 25 10 22 0 12 0 3 0 0
Leptotrombidium pallidum 28 0 21 0 21 0 7 0 0
Heterodoxus macropus 36 18 22 0 5 0 14 1 0
Echinodermata
Acanthaster planci 24 20 22 0 2 0 2 0 0
Chordata
Arenaria interpres 24 21 22 0 1 0 2 0 0
Chauliodus sloani 34 24 22 0 1 0 12 3 3
Coreoleuciscus splendidus 25 21 22 0 1 0 3 0 0
Cottus reinii 22 20 22 0 2 0 0 0 0
Dallia pectoralis 22 21 22 0 1 0 0 0 0
Elephas maximus 23 21 22 0 1 0 1 0 0
Onychodactylus fischeri 24 21 22 0 1 0 2 0 0
Porichthys myriaster 24 19 22 1 3 0 3 0 0
Saccopharynx lavenbergi 28 20 22 0 2 0 6 0 0
Semnopithecus entellus 23 20 22 1 2 0 2 0 0
Total 581 346 469 3 131 0 115 8 4
a
Number of tRNAs according to RefSeq annotations.
b
Number of tRNAs not detected when using both ARWEN and tRNAscan-SE.
c
Number of false positives in common when using both ARWEN and tRNAscan-SE.
positives is possible. The results presented in Table 1 show that, 6 genomes. Clearly, some genomes have sequence properties
with the options used, ARWEN is superior to tRNAscan-SE in that fool the selection filters in the algorithms. Also, with the
detecting true tRNAs. ARWEN only misses 3 tRNAs while algorithm used in ARWEN, selectivity will be expected to
tRNAscan-SE misses 131 tRNAs or 28% of the total number degrade for genomes with an extraordinary high GþC content,
of 469. The missed tRNAs are not evenly distributed among the leading to detection of more false positives. Taken together, it
genomes. In 6 of the genomes, tRNAscan-SE detects 50% of becomes evident that the two software programs are written
the annotated tRNAs. In the most extreme case, that of the with different aims. While ARWEN identifies nearly all
genome of Leptotrombidium pallidum, tRNAscan-SE detects tRNAs, tRNAscan-SE almost never mis-predicts a tRNA.
none of the 21 annotated tRNA genes. When ARWEN and Combined use of the two will lead to a great improvement in
tRNAscan-SE are used together, no tRNAs are missed in any annotation of tRNAs in metazoan mitochondrial sequences.
of the genomes tested. Typically, these genomes have been annotated by first screening
On the other hand, ARWEN predicts 115 false positives for tRNAs by using tRNAscan-SE and then manual identifica-
while tRNAscan-SE only reports 8. Again the distribution of tion of missing tRNAs by comparing the genome under
the false positives predicted by ARWEN is not even. Sixty-four investigation with that of a related organism. From an
percent of the falsely predicted tRNAs are derived from annotator’s point of view, it is better to have too much than
174
ARWEN
Downloaded from https://academic.oup.com/bioinformatics/article-abstract/24/2/172/228155 by guest on 16 October 2018
Fig. 1. Example output of the computer program ARWEN from searches on H.sapiens and Ascaris suum mitochondrial genomes. (a) Standard
cloverleaf structure tRNA-Cys from H.sapiens mitochondrion. (b) D-replacement tRNA-Ser from H.sapiens mitochondrion. (c) TV-replacement
tRNA-Val from A.suum mitochondrion.
too little. Many falsely predicted tRNAs can be easily removed ACKNOWLEDGEMENTS
since they are positioned in protein encoding sequences. The authors wish to thank the Murdoch University Guild
It should also be noted that RefSeq annotations in at least of Students and School of Mathematics for their generous
a few cases seem to be heavily dependent on predictions provision of access to Unix computers.
from tRNAscan-SE. One such case is the genome of Axinella
corrugata where RefSeq annotations are identical to the output Conflict of Interest: none declared.
produced by tRNAscan-SE. If so, the test results will be biased
in favor of tRNAscan-SE.
We have here developed a computer program for detection
of metazoan mitochondrial tRNA genes. We see several REFERENCES
advantages of releasing a new algorithm. (i) The sensitivity is Helm,M. et al. (2000) Search for characteristic structural features of mammalian
100% for most mammalian mitochondrial genomes sequenced mitochondrial tRNAs. RNA, 6, 1356–1379.
Jameson,D. et al. (2003) OGRe: a relational database for comparative analysis
so far, and greater than tRNAscan-SE for most metazoan
of mitochondrial genomes. Nucleic Acids Res., 31, 202–206.
mitochondrial sequences. (ii) ARAGORN reports the tRNA Laslett,D. and Canback,B. (2004) ARAGORN, a program for the detection of
secondary structure in an intuitive way, as a cloverleaf diagram. transfer RNA and transfer-messenger RNA genes in nucleotide sequences.
tRNAscan-SE also reports secondary structure, however, the Nucleic Acids Res., 32, 11–16.
linear representation of the secondary structure is not as easy Lowe,T.M. and Eddy,S.R. (1997) tRNAscan-SE: a program for improved
detection of transfer RNA genes in genomic sequence. Nucleic Acids Res., 25,
to interpret. (iii) ARWEN may be run with no options and
955–964.
still produce the desired results. For an inexperienced user Pruitt,K.D. et al. (2005) NCBI Reference Sequence (RefSeq): a curated non-
tRNAscan-SE may be more difficult to use, especially when redundant sequence database of genomes, transcripts and proteins. Nucleic
using appropriate translation tables. Acids Res., 33, D501–D504.
175