Skip to content
Markus edited this page Dec 1, 2025 · 45 revisions

TD2 - a tool to find coding regions within transcripts

install with bioconda Pypi Release

TD2 follows a similar workflow and command syntax as TransDecoder. For documentation on the original TransDecoder, see https://github.com/TransDecoder/TransDecoder/wiki

TD2 identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using StringTie. TD2 is a de novo ORF finder. If a reference genome annotation is available, we recommend using ORFanage (https://github.com/alevar/ORFanage).

TD2 can be applied to an entire transcriptome for a single organism involving thousands of transcript sequences as input.

Unlike the original TransDecoder, TD2 can also be run on small sets or even individual transcripts. Because it uses a pre-trained protein model, TD2 produces stable predictions regardless of transcript origin, even in transcripts from mixtures of organisms.

TD2 identifies likely coding sequences based on the following criteria:

  • a minimum length open reading frame (ORF) is found in a transcript sequence

AND (

OR

  • the length of an ORF is longer than would be expected based on an extreme value analysis of ORF lengths, given a specified false discovery rate

)

Obtaining TD2

TD2 is available via pip. We intend to release TD2 on more platforms depending on user need.

pip install TD2

It may help to install TD2 in a virtual environment 🙂

python3 -m venv /path/to/new/virtual/environment
source /path/to/new/virtual/environment/bin/activate
pip install TD2

or...

conda create -n TD2_env python=3.11
conda activate TD2_env
pip install TD2

Running TD2

Predicting coding regions from a transcript fasta file

The 'TD2' utility is run on a fasta file containing the target transcript sequences. The simplest usage is as follows:

Step 1: extract the long open reading frames

TD2.LongOrfs -t target_transcripts.fasta [ options ]

By default, TD2.LongOrfs will identify ORFs that are at least 90 amino acids long. You can modify this via the '-m' parameter, but know that the rate of false positive ORF predictions increases drastically with shorter minimum length requirements.

If the transcripts are oriented according to the sense strand, then include the -S flag to examine only the top strand.

TD2.LongOrfs --help will give more information on all available options.

Step 2: (optional)

Optionally, identify ORFs with homology to known proteins via MMSeqs2, blastp or HMMER3 searches.

See Including homology searches as ORF retention criteria section below.

Step 3: predict the likely coding regions

TD2.Predict -t target_transcripts.fasta [ options ]

The final set of candidate coding regions can be found as files '.TD2.' where extensions include .pep, .cds, .gff3, and .bed

Starting from a genome-based transcript structure GTF file (e.g. cufflinks or stringtie)

The process here is identical to the above with the exception that we must first generate a fasta file corresponding to the transcript sequences, and in the end, we recompute a genome annotation file in GFF3 format that describes the predicted coding regions in the context of the genome.

Construct the transcript fasta file using the genome and the transcripts.gtf file like so:

util/gtf_genome_to_cdna_fasta.pl transcripts.gtf test.genome.fasta > transcripts.fasta 

Next, convert the transcript structure GTF file to an alignment-GFF3 formatted file (this is done only because our processes operate on gff3 rather than the starting gtf file - nothing of great consequence). Convert gtf to alignment-gff3 like so, using cufflinks GTF output as an example:

util/gtf_to_alignment_gff3.pl transcripts.gtf > transcripts.gff3

Now, run the process described above to generate your best candidate ORF predictions:

TD2.LongOrfs -t transcripts.fasta [ options ]
(you should probably use the "-S" option with TD2.LongOrfs, assuming your transcripts have a known splicing direction)

(optionally, identify peptides with homology to known proteins)

TD2.Predict -t transcripts.fasta [ options ]

And finally, generate a genome-based coding region annotation file:

util/cdna_alignment_orf_to_genome_orf.pl \
     transcripts.fasta.TD2.gff3 \
     transcripts.gff3 \
     transcripts.fasta > transcripts.fasta.TD2.genome.gff3

Output files explained

A working directory (e.g. transcripts.TD2_dir/) is created to run and store intermediate parts of the pipeline, and contains:

longest_orfs.pep   : all ORFs meeting the minimum length criteria, regardless of coding potential.
longest_orfs.gff3  : positions of all ORFs as found in the target transcripts
longest_orfs.cds   : the nucleotide coding sequence for all detected ORFs    
psauron_score.csv  : the PSAURON scores for each ORF

Then, the final outputs are reported in your current working directory:

transcripts.fasta.TD2.pep : peptide sequences for the final candidate ORFs; all shorter candidates within longer ORFs were removed.
transcripts.fasta.TD2.cds  : nucleotide sequences for coding regions of the final candidate ORFs
transcripts.fasta.TD2.gff3 : positions within the target transcripts of the final selected ORFs
transcripts.fasta.TD2.bed  : bed-formatted file describing ORF positions, best for viewing using GenomeView or IGV.

Including homology searches as ORF retention criteria

To further maximize sensitivity for capturing ORFs that may have functional significance, regardless of coding likelihood score as mentioned above, you can scan all ORFs for homology to known proteins and retain all such ORFs. This can be done in multiple ways: a BLAST search against a database of known proteins, searching PFAM to identify common protein domains, or an MMSeqs2 search against either sequence or profile databases.

After running TD2.LongOrfs, you'll find a multi-fasta protein file called '${transcripts_file}.TD2_dir/longest_orfs.pep'. Search these candidate peptides for homology using approaches below.

We suggest using MMseqs2 (https://github.com/soedinglab/MMseqs2) for increased speed and sensitivity. TD2 also accepts input from blastp and pfam searches.

An example command would be like so:

mmseqs databases UniProtKB/Swiss-Prot swissprot tmp
mmseqs easy-search transcripts.TD2_dir/longest_orfs.pep swissprot alnRes.m8 tmp -s 7.0

The sensitivity of this search can be adjusted with -s. We suggest a setting of -s 7.0 for a very sensitive search.

MMseqs2 can be used to search against a wide variety of databases:

# mmseqs databases
Usage: mmseqs databases <name> <o:sequenceDB> <tmpDir> [options]

  Name                	Type      	Taxonomy	Url
- UniRef100           	Aminoacid 	     yes	https://www.uniprot.org/help/uniref
- UniRef90            	Aminoacid 	     yes	https://www.uniprot.org/help/uniref
- UniRef50            	Aminoacid 	     yes	https://www.uniprot.org/help/uniref
- UniProtKB           	Aminoacid 	     yes	https://www.uniprot.org/help/uniprotkb
- UniProtKB/TrEMBL    	Aminoacid 	     yes	https://www.uniprot.org/help/uniprotkb
- UniProtKB/Swiss-Prot	Aminoacid 	     yes	https://uniprot.org
- NR                  	Aminoacid 	     yes	https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- NT                  	Nucleotide	       -	https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- GTDB                  Aminoacid	     yes	https://gtdb.ecogenomic.org
- PDB                 	Aminoacid 	       -	https://www.rcsb.org
- PDB70               	Profile   	       -	https://github.com/soedinglab/hh-suite
- Pfam-A.full         	Profile   	       -	https://pfam.xfam.org
- Pfam-A.seed         	Profile   	       -	https://pfam.xfam.org
- Pfam-B              	Profile   	       -	https://xfam.wordpress.com/2020/06/30/a-new-pfam-b-is-released
- CDD                   Profile                -        https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
- eggNOG              	Profile   	       -	http://eggnog5.embl.de
- VOGDB                 Profile                -        https://vogdb.org
- dbCAN2              	Profile   	       -	http://bcb.unl.edu/dbCAN2
- SILVA                 Nucleotide           yes        https://www.arb-silva.de
- Resfinder           	Nucleotide	       -	https://cge.cbs.dtu.dk/services/ResFinder
- Kalamari            	Nucleotide	     yes	https://github.com/lskatz/Kalamari

Detailed notes for MMseqs2 usage can be found here: https://github.com/soedinglab/mmseqs2/wiki

Integrating homology search results into coding region selection

The outputs generated above can be leveraged by TD2 to ensure that those peptides with blast hits or domain hits are retained in the set of reported likely coding regions. Run TD2.Predict like so:

TD2.Predict -t target_transcripts.fasta --retain-mmseqs-hits alnRes.m8

If you have multiple MMSeqs2 search results, the alnRes.m8 files can simply be concatenated together into a single file and run like so:

cat file1.m8 file2.m8 > combined_alnRes.m8
TD2.Predict -t target_transcripts.fasta --retain-mmseqs-hits combined_alnRes.m8

The final coding region predictions will now include both those regions that have sequence characteristics consistent with coding regions in addition to those that have demonstrated homology to sequence or profile databases.

Hardware Requirements

TD2 does not require a GPU to run, though it will speed up the Predict step if available. All analyses for the preprint were conducted on a MacBook M2 laptop.

Referencing TD2

Preprint here: https://www.biorxiv.org/content/10.1101/2025.04.13.648579v1

A. Mao, H. J. Ji, B. Haas, S. Salzberg, M. J. Sommer, TD2: finding protein coding regions in transcripts, bioRxiv (2025)p. 2025.04. 13.648579.