Benchmarking library for TMVec-2 Suite, comparing to structure alignment methods like Foldseek and TMAlign.
This repo benchmarks four protein structure similarity methods against TM-Align scores:
- Foldseek: Fast structure comparison using 3Di sequences
- TM-Vec: Neural network model for TM-score prediction from ProtT5-XL embeddings
- TM-Vec 2: Optimized architecture using Lobster-24M foundation model
- TM-Vec 2s: BiLSTM student model distilled from TM-Vec 2
git clone https://github.com/paarth-b/tmvec-bench.git
cd tmvec-benchUsing uv (recommended):
Install uv if not already installed:
wget -qO- https://astral.sh/uv/install.sh | shInstall dependencies using uv:
uv sync
source .venv/bin/activateOr using pip:
pip install -r requirements.txtThe provided binary binaries/TMalign requires x86-64 architecture. For other architectures (e.g., Apple Silicon), download from Zhang Group website.
Download from Foldseek GitHub releases.
Place the Foldseek executable in binaries/foldseek:
# Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)
wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz
tar xvzf foldseek-linux-avx2.tar.gz
mv foldseek/bin/foldseek binaries/foldseek
chmod +x binaries/foldseek# Linux ARM64 build
wget https://mmseqs.com/foldseek/foldseek-linux-arm64.tar.gz
tar xvzf foldseek-linux-arm64.tar.gz
mv foldseek/bin/foldseek binaries/foldseek
chmod +x binaries/foldseekVerify installation:
binaries/foldseek versionDownload the TM-Vec CATH checkpoint:
Using huggingface cli (recommended):
huggingface-cli download scikit-bio/tmvec-cath tm_vec_cath_model.ckpt --local-dir binaries/Or download manually from HuggingFace Hub and place tm_vec_cath_model.ckpt in binaries/.
# TM-Vec 2 (Lobster-based teacher model)
huggingface-cli download scikit-bio/tmvec-2 --local-dir models/tmvec-2
# TM-Vec 2s (student model) - already provided in binaries/
# File: binaries/tmvec2_student.ptThe configuration file binaries/tm_vec_cath_model_params.json is already included in the repository.
Unzip data/fasta/cath-domain-seqs.zip to get data/fasta/cath-domain-seqs.fa.
unzip data/fasta/cath-domain-seqs.zip -d data/fastaThe benchmarks use the first 1,000 domains from CATH S100 (non-redundant at 100% sequence identity).
The FASTA file is already provided at data/cath-top1k.fa. We provide a zip file of the first 1000 domains of CATH S100 for convenience, that can be unzipped to get the PDB structures.
unzip data/cath-pdb.zip -d data/Alternatively, if you choose to download structures for the 1000 domains from CATH Database:
mkdir -p data/pdb/cath-s100
python src/util/download_structures.py \
--fasta data/cath-top1k.fa \
--output-dir data/pdb/cath-s100 \
--dataset cathThis will download ~1000 PDB structures from RCSB PDB.
The benchmarks use 1,000 domains from SCOPe 2.01 clustered at 40% sequence identity.
The FASTA file is already provided at data/fasta/scope40-1000.fa. We provide a zip file of the first 1000 domains of SCOPe 2.01 for convenience hosted on Google Drive, that can be unzipped to get the PDB structures.
wget "https://drive.usercontent.google.com/download?id=1HjtC7Dv-MZABO9wr5PYr5DPLZ6S642P6&export=download&confirm=t" -O data/scope40-pdb.zip
unzip data/scope40-pdb.zip -d data/Alternatively, if you choose to download structures for the 1000 domains from SCOPe Database:
mkdir -p data/scope40pdb
python src/util/download_structures.py \
--fasta data/fasta/scope40-1000.fa \
--output-dir data/scope40pdb \
--dataset scope40This downloads from ASTRAL/RCSB PDB.
Using bash scripts in scripts/ (recommended on clusters):
# This will run the benchmarks on the CATH S100 and SCOPe40 datasets, as well as the time benchmarks and generate the plots.
bash scripts/tmvec2_student.sh
bash scripts/tmvec2.sh
bash scripts/tmvec1.sh
bash scripts/foldseek.sh
bash scripts/tmalign.shAlternatively, all benchmark code is in src/benchmarks and src/time_benchmarks. They can be run locally.
uv run python -m src.benchmarks.{model_file}
uv run python -m src.time_benchmarks.{time_benchmark_file}Example:
uv run python -m src.benchmarks.tmvec1
uv run python -m src.time_benchmarks.tmvec1_time_benchmarkNOTE: TMAlign is a cpu-based script, and may take a long time (>10 Hours) to generate 500,000 pair scores. For convenience, TMAlign results already exist in the results/ folder.
All benchmarks generate CSV files in results/ with the following format:
| seq1_id | seq2_id | tm_score | evalue (Foldseek only) |
|---|---|---|---|
| 107lA00 | 108lA00 | 0.8523 | 1.2e-10 |
| 107lA00 | 109lA00 | 0.7234 | 3.4e-08 |
To generate plots from results, follow readme instructions in the following: Generated plots are available in the plots subfolders.
# CATH visualizations
cd src/plotting/cath
# SCOPe visualizations
cd src/plotting/scope
# Runtime benchmarks
cd src/plotting/timePlots are saved to figures/ and include:
- ROC curves (homology detection at different classification levels)
- PR curves (precision-recall)
- Density scatter plots (predicted vs. true TM-scores)
- Runtime comparisons (encoding and query times)
To validate the results in the ISMB 2026 paper:
-
Table 1 (Prediction Accuracy): Run all benchmarks on both CATH and SCOPe40, then compare the generated CSVs against TM-align ground truth using the plotting notebooks.
-
Figure 4 (TM-score Prediction): Generate density scatter plots showing correlation between predicted and true TM-scores.
-
Figure 5 (Homology Detection): Use the ground truth classification files to compute ROC/PR curves at different hierarchy levels (Class → Superfamily/Family).
-
Supplementary Tables (Runtime): Time benchmarks are in
src/time_benchmarks/. Results should match the encoding/query time tables.