TM-Vec 2 Benchmarking

Benchmarking library for TMVec-2 Suite, comparing to structure alignment methods like Foldseek and TMAlign.

Description

This repo benchmarks four protein structure similarity methods against TM-Align scores:

Foldseek: Fast structure comparison using 3Di sequences
TM-Vec: Neural network model for TM-score prediction from ProtT5-XL embeddings
TM-Vec 2: Optimized architecture using Lobster-24M foundation model
TM-Vec 2s: BiLSTM student model distilled from TM-Vec 2

Installation

1. Clone Repository

git clone https://github.com/paarth-b/tmvec-bench.git
cd tmvec-bench

2. Install Python Dependencies

Using uv (recommended):

Install uv if not already installed:

wget -qO- https://astral.sh/uv/install.sh | sh

Install dependencies using uv:

uv sync
source .venv/bin/activate

Or using pip:

pip install -r requirements.txt

3. Download Required Binaries

TMalign Binary

The provided binary binaries/TMalign requires x86-64 architecture. For other architectures (e.g., Apple Silicon), download from Zhang Group website.

Foldseek Binary

Download from Foldseek GitHub releases. Place the Foldseek executable in binaries/foldseek:

# Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)
wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz
tar xvzf foldseek-linux-avx2.tar.gz
mv foldseek/bin/foldseek binaries/foldseek
chmod +x binaries/foldseek

# Linux ARM64 build
wget https://mmseqs.com/foldseek/foldseek-linux-arm64.tar.gz
tar xvzf foldseek-linux-arm64.tar.gz
mv foldseek/bin/foldseek binaries/foldseek
chmod +x binaries/foldseek

Verify installation:

binaries/foldseek version

4. Download Model Checkpoints

TM-Vec (original model)

Download the TM-Vec CATH checkpoint:

Using huggingface cli (recommended):

huggingface-cli download scikit-bio/tmvec-cath tm_vec_cath_model.ckpt --local-dir binaries/

Or download manually from HuggingFace Hub and place tm_vec_cath_model.ckpt in binaries/.

TM-Vec 2 Models

# TM-Vec 2 (Lobster-based teacher model)
huggingface-cli download scikit-bio/tmvec-2 --local-dir models/tmvec-2

# TM-Vec 2s (student model) - already provided in binaries/
# File: binaries/tmvec2_student.pt

The configuration file binaries/tm_vec_cath_model_params.json is already included in the repository.

Dataset Setup

CATH S100 Fasta File

Unzip data/fasta/cath-domain-seqs.zip to get data/fasta/cath-domain-seqs.fa.

unzip data/fasta/cath-domain-seqs.zip -d data/fasta

CATH S100 Dataset

The benchmarks use the first 1,000 domains from CATH S100 (non-redundant at 100% sequence identity).

The FASTA file is already provided at data/cath-top1k.fa. We provide a zip file of the first 1000 domains of CATH S100 for convenience, that can be unzipped to get the PDB structures.

unzip data/cath-pdb.zip -d data/

Alternatively, if you choose to download structures for the 1000 domains from CATH Database:

mkdir -p data/pdb/cath-s100

python src/util/download_structures.py \
    --fasta data/cath-top1k.fa \
    --output-dir data/pdb/cath-s100 \
    --dataset cath

This will download ~1000 PDB structures from RCSB PDB.

SCOPe40 Dataset

The benchmarks use 1,000 domains from SCOPe 2.01 clustered at 40% sequence identity.

The FASTA file is already provided at data/fasta/scope40-1000.fa. We provide a zip file of the first 1000 domains of SCOPe 2.01 for convenience hosted on Google Drive, that can be unzipped to get the PDB structures.

wget "https://drive.usercontent.google.com/download?id=1HjtC7Dv-MZABO9wr5PYr5DPLZ6S642P6&export=download&confirm=t" -O data/scope40-pdb.zip
unzip data/scope40-pdb.zip -d data/

Alternatively, if you choose to download structures for the 1000 domains from SCOPe Database:

mkdir -p data/scope40pdb

python src/util/download_structures.py \
    --fasta data/fasta/scope40-1000.fa \
    --output-dir data/scope40pdb \
    --dataset scope40

This downloads from ASTRAL/RCSB PDB.

Running Benchmarks

Using bash scripts in scripts/ (recommended on clusters):

# This will run the benchmarks on the CATH S100 and SCOPe40 datasets, as well as the time benchmarks and generate the plots.
bash scripts/tmvec2_student.sh
bash scripts/tmvec2.sh
bash scripts/tmvec1.sh
bash scripts/foldseek.sh
bash scripts/tmalign.sh

Alternatively, all benchmark code is in src/benchmarks and src/time_benchmarks. They can be run locally.

uv run python -m src.benchmarks.{model_file}
uv run python -m src.time_benchmarks.{time_benchmark_file}

Example:

uv run python -m src.benchmarks.tmvec1
uv run python -m src.time_benchmarks.tmvec1_time_benchmark

NOTE: TMAlign is a cpu-based script, and may take a long time (>10 Hours) to generate 500,000 pair scores. For convenience, TMAlign results already exist in the results/ folder.

Output Files

Similarity Results

All benchmarks generate CSV files in results/ with the following format:

seq1_id	seq2_id	tm_score	evalue (Foldseek only)
107lA00	108lA00	0.8523	1.2e-10
107lA00	109lA00	0.7234	3.4e-08

Visualization

To generate plots from results, follow readme instructions in the following: Generated plots are available in the plots subfolders.

# CATH visualizations
cd src/plotting/cath

# SCOPe visualizations
cd src/plotting/scope

# Runtime benchmarks
cd src/plotting/time

Plots are saved to figures/ and include:

ROC curves (homology detection at different classification levels)
PR curves (precision-recall)
Density scatter plots (predicted vs. true TM-scores)
Runtime comparisons (encoding and query times)

Validation of Published Results

To validate the results in the ISMB 2026 paper:

Table 1 (Prediction Accuracy): Run all benchmarks on both CATH and SCOPe40, then compare the generated CSVs against TM-align ground truth using the plotting notebooks.
Figure 4 (TM-score Prediction): Generate density scatter plots showing correlation between predicted and true TM-scores.
Figure 5 (Homology Detection): Use the ground truth classification files to compute ROC/PR curves at different hierarchy levels (Class → Superfamily/Family).
Supplementary Tables (Runtime): Time benchmarks are in src/time_benchmarks/. Results should match the encoding/query time tables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TM-Vec 2 Benchmarking

Description

Installation

1. Clone Repository

2. Install Python Dependencies

3. Download Required Binaries

TMalign Binary

Foldseek Binary

4. Download Model Checkpoints

TM-Vec (original model)

TM-Vec 2 Models

Dataset Setup

CATH S100 Fasta File

CATH S100 Dataset

SCOPe40 Dataset

Running Benchmarks

Output Files

Similarity Results

Visualization

Validation of Published Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
binaries		binaries
data		data
figures		figures
results		results
scripts		scripts
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
core		core
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

TM-Vec 2 Benchmarking

Description

Installation

1. Clone Repository

2. Install Python Dependencies

3. Download Required Binaries

TMalign Binary

Foldseek Binary

4. Download Model Checkpoints

TM-Vec (original model)

TM-Vec 2 Models

Dataset Setup

CATH S100 Fasta File

CATH S100 Dataset

SCOPe40 Dataset

Running Benchmarks

Output Files

Similarity Results

Visualization

Validation of Published Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages