Protap: A Benchmark for Protein Modeling on Realistic Downstream Applications

This project is the codebase for Protap, a comprehensive benchmark that systematically compares backbone architectures, pretraining strategies, and domain-specific models across diverse and realistic downstream protein applications.

Table of Contents

Summary of Pretraining Models
Pretraining Strategy Illustration
Summary of Domain-Specific Models
Performance comparison across model architectures under different training strategies
Environment Installation
Usage
- Pretrain
- Downstream Applications

Summary of Pretraining Models in Protap

Model	Input Modalities	Pretrain Data	#Params	Objective	Source
🔴 `()` EGNN*	AA Seq & 3D Coord	Swiss-Prot 540k	10M	MLM, MVCL, PFP	ICML 2021
🔴 `()` SE(3) Transformer*	AA Seq & 3D Coord	Swiss-Prot 540k	4M	MLM, MVCL, PFP	NeurIPS 2020
🔴 `()` GVP*	AA Seq & 3D Coord	Swiss-Prot 540k	0.2M	MLM, MVCL, PFP	ICLR 2021
🔴 `()` ProteinBERT*	AA Seq	Swiss-Prot 540k	72M	MLM, MVCL, PFP	Bioinformatics 2022
🔴 `()` D-Transformer*	AA Seq & 3D Coord	Swiss-Prot 540k	3.5M	MLM, MVCL, PFP	ArXiv 2025, ICLR 2023
🔵 `(#)` ESM2	AA Seq	UR50 70M	650M	MLM	Science 2023

🔴 (*) Trained from scratch, For models trained from scratch, we provide complete training code as well as downstream evaluation scripts.

🔵 (#) Uses publicly available pretrained weights

AA Seq: amino acid sequence

3D Coord: 3D coordinates of protein structures

Illustration of pretraining strategy in Protap.

(I) Masked Language Modeling(MLM) is a self-supervised objective designed to recover masked residues in protein sequences;
(II) Multi-View Contrastive Learning(MVCL) leverages protein structural information by aligning representations of biologically correlated substructures.
(III) Protein Family Prediction(PFP) introduces functional and structural supervision by training models to predict family labels based on protein sequences and 3D structures.

Summary of Domain-Specific Models in Protap

Model	Input Modalities	Pretrain Data	#Params	Objective	Source
🟤 `($)` ClipZyme	AA Seq & 3D Coord & SMILES	—	14.8M	PFS	ICML 2024
🟤 `($)` UniZyme	AA Seq & 3D Coord	Swiss-Prot 11k	15.5M	PFS	ArXiv 2025
🟤 `($)` DeepProtacs	AA Seq & 3D Coord & SMILES	—	0.1M	PROTACs	Nat. Comm 2022
🟤 `($)` ET-Protacs	AA Seq & 3D Coord & SMILES	—	5.4M	PROTACs	Brief Bioinf 2025
🟤 `($)` KDBNet	AA Seq & 3D Coord & SMILES	—	3.4M	PLI	Nat. Mach. Intell 2023
🟤 `($)` MONN	AA Seq & 3D Coord	—	1.7M	PLI	Cell Systems 2024
🟤 `($)` DeepFRI	AA Seq & 3D Coord	Pfam 10M	1.8M	AFP	Nat. Comm 2021
🟤 `($)` DPFunc	AA Seq & 3D Coord & Domain	—	110M	AFP	Nat. Comm 2025

🟤 ($) domain-specific models tailored for specific biological tasks, For Domain-Specific Models, we provide github links.

PFS: enzyme-Catalyzed Protein Cleavage Site Prediction

PROTACs: Targeted Protein Degradation

PLI: Protein–Ligand Interactions

AFP: Protein Function Annotation Prediction

Performance comparison across model architectures under different training strategies

Environment installation

conda create -n protap python=3.12
conda activate protap
pip install -r requirements.txt

Usage

Pretrain

To pretrain from scratch on the Swiss-Prot 540k dataset, simply execute the corresponding bash script for each model. The pretraining strategy and other parameters are customizable. An example of the bash script arguments is shown below:

torchrun --nproc_per_node=8  egnn_pretrain.py \
    --model_name_or_path "protap/egnn" \
    --data_path "./data/protein_family_2" \
    --bf16 True \
    --output_dir "./checkpoints/egnn/" \
    --run_name 'egnn-pretrain-family-0419' \
    --residue_prediction False \
    --subseq_length 50 \
    --max_nodes 50 \
    --temperature 0.01 \
    --task 'family_prediction' \
    --num_train_epochs 70 \
    --per_device_train_batch_size 48 \
    --per_device_eval_batch_size 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5000 \
    --save_total_limit 1 \
    --learning_rate 1e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.05 \
    --lr_scheduler_type "constant_with_warmup" \
    --logging_steps 1 \
    --fsdp no_shard \

Parameter	Description	Example Value
Distributed & Hardware
`--nproc_per_node`	Number of GPUs for `torchrun` per node	`8`
`--fsdp`	FSDP sharding strategy: `no_shard`, `full_shard`, or `auto_wrap`	`no_shard`
Model & Dataset
`--model_name_or_path`	Hugging Face model name or local pretrained checkpoint	`"protap/egnn"`
`--data_path`	Path to the pretraining dataset	`"./data/protein_family_2"`
`--residue_prediction`	Enable residue-level prediction (like MLM or site prediction)	`False`
`--task`	Pretraining task: `family_prediction`, `residue_prediction` or `MVCL`	`'family_prediction'`
Sequence & Graph Settings
`--subseq_length`	Max subsequence length (number of residues per sample)	`50`
`--max_nodes`	Maximum number of graph nodes (residues)	`50`
`--temperature`	Temperature for contrastive/probabilistic losses	`0.01`
Training Loop
`--num_train_epochs`	Total number of training epochs	`70`
`--per_device_train_batch_size`	Training batch size per GPU	`48`
`--per_device_eval_batch_size`	Evaluation batch size per GPU	`4`
`--learning_rate`	Initial learning rate	`1e-3`
`--weight_decay`	Weight decay (L2 regularization)	`0.`
`--warmup_ratio`	Fraction of steps for LR warmup	`0.05`
`--lr_scheduler_type`	LR schedule: `constant`, `linear`, `cosine`, `constant_with_warmup`	`"constant_with_warmup"`
Evaluation & Logging
`--evaluation_strategy`	When to run evaluation: `no`, `steps`, `epoch`	`"no"`
`--logging_steps`	Logging interval in steps	`1`
Checkpointing
`--output_dir`	Directory to save checkpoints and weights	`"./checkpoints/egnn/"`
`--run_name`	Run name for wandb logging	`'egnn-pretrain-family-0419'`
`--save_strategy`	When to save checkpoints: `steps` or `epoch`	`"steps"`
`--save_steps`	Save checkpoint every N steps	`5000`
`--save_total_limit`	Maximum number of checkpoints to keep (older deleted)	`1`
Precision & Performance
`--bf16`	Enable bfloat16 mixed-precision training	`True`

Downstream Applications

To evaluate pretrained models on various downstream tasks, please download the dataset from Hugging Face and run the corresponding bash script for each task. The dataset path, pretrained weights, and other parameters are customizable. An example of the bash script arguments is shown below:

torchrun --nproc_per_node=8 --master_port=23333 egnn_protac.py \
    --model_name_or_path './checkpoints/egnn_contrastive.pt' \
    --data_path "./data/protac_2" \
    --bf16 True \
    --output_dir "./checkpoints/egnn/" \
    --run_name 'egnn-protac-cl-0428' \
    --residue_prediction False \
    --subseq_length 50 \
    --max_nodes 50 \
    --temperature 0.01 \
    --num_train_epochs 50 \
    --seed 1024 \
    --load_pretrain True \
    --per_device_train_batch_size 24 \
    --per_device_eval_batch_size 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 5000 \
    --save_total_limit 1 \
    --learning_rate 5e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "constant_with_warmup" \
    --logging_steps 1 \

Parameter	Description	Example Value
Distributed & Hardware
`--master_port`	TCP port for multi-GPU communication	`23333`
Model & Dataset
`--load_pretrain`	Whether to load pretrained weights for fine-tuning	`True`
Training Loop
`--learning_rate`	Typically smaller than in pretraining	`5e-4`
`--num_train_epochs`	Often fewer epochs than in pretraining	`50`
`--per_device_train_batch_size`	Often smaller than in pretraining	`24`
Evaluation & Logging
`--seed`	Random seed for reproducibility	`1024`

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
.idea		.idea
checkpoints		checkpoints
data		data
figures		figures
models		models
trainers		trainers
utils		utils
.gitignore		.gitignore
D_trans_mlm_pretrain.py		D_trans_mlm_pretrain.py
D_trans_mvcl_pretrain.py		D_trans_mvcl_pretrain.py
D_trans_pfp_pretrain.py		D_trans_pfp_pretrain.py
README.md		README.md
cleave_egnn.py		cleave_egnn.py
cleave_egnn.sh		cleave_egnn.sh
cleave_protbert.py		cleave_protbert.py
cleave_protbert.sh		cleave_protbert.sh
cleave_se3.py		cleave_se3.py
cleave_se3.sh		cleave_se3.sh
dataset_construct.py		dataset_construct.py
egnn_cleavage.py		egnn_cleavage.py
egnn_cleavage.sh		egnn_cleavage.sh
egnn_dms.py		egnn_dms.py
egnn_go.py		egnn_go.py
egnn_go.sh		egnn_go.sh
egnn_pli.py		egnn_pli.py
egnn_pli.sh		egnn_pli.sh
egnn_pretrain.py		egnn_pretrain.py
egnn_pretrain.sh		egnn_pretrain.sh
egnn_protac.py		egnn_protac.py
egnn_protac.sh		egnn_protac.sh
gvp_mlm_pretrain.py		gvp_mlm_pretrain.py
gvp_mvcl_pretrain.py		gvp_mvcl_pretrain.py
gvp_pfp_pretrain.py		gvp_pfp_pretrain.py
protbert_dms.py		protbert_dms.py
protein_bert_pretrain.py		protein_bert_pretrain.py
protein_bert_pretrain.sh		protein_bert_pretrain.sh
proteinbert_cleavage.py		proteinbert_cleavage.py
proteinbert_cleavage.sh		proteinbert_cleavage.sh
proteinbert_go.py		proteinbert_go.py
proteinbert_go.sh		proteinbert_go.sh
proteinbert_pli.py		proteinbert_pli.py
proteinbert_pli.sh		proteinbert_pli.sh
proteinbert_protac.py		proteinbert_protac.py
proteinbert_protac.sh		proteinbert_protac.sh
requirement.txt		requirement.txt
se3_dms.py		se3_dms.py
se3_go.py		se3_go.py
se3_go.sh		se3_go.sh
se3_pli.py		se3_pli.py
se3_pli.sh		se3_pli.sh
se3_pretrain.py		se3_pretrain.py
se3_pretrain.sh		se3_pretrain.sh
se3_protac.py		se3_protac.py
se3_protac.sh		se3_protac.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Protap: A Benchmark for Protein Modeling on Realistic Downstream Applications

Summary of Pretraining Models in Protap

Illustration of pretraining strategy in Protap.

Summary of Domain-Specific Models in Protap

Performance comparison across model architectures under different training strategies

Environment installation

Usage

Pretrain

Downstream Applications

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Trust-App-AI-Lab/protap

Folders and files

Latest commit

History

Repository files navigation

Protap: A Benchmark for Protein Modeling on Realistic Downstream Applications

Summary of Pretraining Models in Protap

Illustration of pretraining strategy in Protap.

Summary of Domain-Specific Models in Protap

Performance comparison across model architectures under different training strategies

Environment installation

Usage

Pretrain

Downstream Applications

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages