This project is the codebase for Protap, a comprehensive benchmark that systematically compares backbone architectures, pretraining strategies, and domain-specific models across diverse and realistic downstream protein applications.
Table of Contents
| Model | Input Modalities | Pretrain Data | #Params | Objective | Source |
|---|---|---|---|---|---|
🔴 (*) EGNN |
AA Seq & 3D Coord | Swiss-Prot 540k | 10M | MLM, MVCL, PFP | ICML 2021 |
🔴 (*) SE(3) Transformer |
AA Seq & 3D Coord | Swiss-Prot 540k | 4M | MLM, MVCL, PFP | NeurIPS 2020 |
🔴 (*) GVP |
AA Seq & 3D Coord | Swiss-Prot 540k | 0.2M | MLM, MVCL, PFP | ICLR 2021 |
🔴 (*) ProteinBERT |
AA Seq | Swiss-Prot 540k | 72M | MLM, MVCL, PFP | Bioinformatics 2022 |
🔴 (*) D-Transformer |
AA Seq & 3D Coord | Swiss-Prot 540k | 3.5M | MLM, MVCL, PFP | ArXiv 2025, ICLR 2023 |
🔵 (#) ESM2 |
AA Seq | UR50 70M | 650M | MLM | Science 2023 |
- 🔴
(*)Trained from scratch, For models trained from scratch, we provide complete training code as well as downstream evaluation scripts.- 🔵
(#)Uses publicly available pretrained weights- AA Seq: amino acid sequence
- 3D Coord: 3D coordinates of protein structures
(I) Masked Language Modeling(MLM) is a self-supervised objective designed to recover masked residues in protein sequences;
(II) Multi-View Contrastive Learning(MVCL) leverages protein structural information by aligning representations of biologically correlated substructures.
(III) Protein Family Prediction(PFP) introduces functional and structural supervision by training models to predict family labels based on protein sequences and 3D structures.
| Model | Input Modalities | Pretrain Data | #Params | Objective | Source | Github |
|---|---|---|---|---|---|---|
🟤 ($) ClipZyme |
AA Seq & 3D Coord & SMILES | — | 14.8M | PFS | ICML 2024 | |
🟤 ($) UniZyme |
AA Seq & 3D Coord | Swiss-Prot 11k | 15.5M | PFS | ArXiv 2025 | |
🟤 ($) DeepProtacs |
AA Seq & 3D Coord & SMILES | — | 0.1M | PROTACs | Nat. Comm 2022 | |
🟤 ($) ET-Protacs |
AA Seq & 3D Coord & SMILES | — | 5.4M | PROTACs | Brief Bioinf 2025 | |
🟤 ($) KDBNet |
AA Seq & 3D Coord & SMILES | — | 3.4M | PLI | Nat. Mach. Intell 2023 | |
🟤 ($) MONN |
AA Seq & 3D Coord | — | 1.7M | PLI | Cell Systems 2024 | |
🟤 ($) DeepFRI |
AA Seq & 3D Coord | Pfam 10M | 1.8M | AFP | Nat. Comm 2021 | |
🟤 ($) DPFunc |
AA Seq & 3D Coord & Domain | — | 110M | AFP | Nat. Comm 2025 |
- 🟤
($)domain-specific models tailored for specific biological tasks, For Domain-Specific Models, we provide github links.- PFS: enzyme-Catalyzed Protein Cleavage Site Prediction
- PROTACs: Targeted Protein Degradation
- PLI: Protein–Ligand Interactions
- AFP: Protein Function Annotation Prediction
conda create -n protap python=3.12
conda activate protap
pip install -r requirements.txt
To pretrain from scratch on the Swiss-Prot 540k dataset, simply execute the corresponding bash script for each model. The pretraining strategy and other parameters are customizable. An example of the bash script arguments is shown below:
torchrun --nproc_per_node=8 egnn_pretrain.py \
--model_name_or_path "protap/egnn" \
--data_path "./data/protein_family_2" \
--bf16 True \
--output_dir "./checkpoints/egnn/" \
--run_name 'egnn-pretrain-family-0419' \
--residue_prediction False \
--subseq_length 50 \
--max_nodes 50 \
--temperature 0.01 \
--task 'family_prediction' \
--num_train_epochs 70 \
--per_device_train_batch_size 48 \
--per_device_eval_batch_size 4 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 5000 \
--save_total_limit 1 \
--learning_rate 1e-3 \
--weight_decay 0. \
--warmup_ratio 0.05 \
--lr_scheduler_type "constant_with_warmup" \
--logging_steps 1 \
--fsdp no_shard \| Parameter | Description | Example Value |
|---|---|---|
| Distributed & Hardware | ||
--nproc_per_node |
Number of GPUs for torchrun per node |
8 |
--fsdp |
FSDP sharding strategy: no_shard, full_shard, or auto_wrap |
no_shard |
| Model & Dataset | ||
--model_name_or_path |
Hugging Face model name or local pretrained checkpoint | "protap/egnn" |
--data_path |
Path to the pretraining dataset | "./data/protein_family_2" |
--residue_prediction |
Enable residue-level prediction (like MLM or site prediction) | False |
--task |
Pretraining task: family_prediction, residue_prediction or MVCL |
'family_prediction' |
| Sequence & Graph Settings | ||
--subseq_length |
Max subsequence length (number of residues per sample) | 50 |
--max_nodes |
Maximum number of graph nodes (residues) | 50 |
--temperature |
Temperature for contrastive/probabilistic losses | 0.01 |
| Training Loop | ||
--num_train_epochs |
Total number of training epochs | 70 |
--per_device_train_batch_size |
Training batch size per GPU | 48 |
--per_device_eval_batch_size |
Evaluation batch size per GPU | 4 |
--learning_rate |
Initial learning rate | 1e-3 |
--weight_decay |
Weight decay (L2 regularization) | 0. |
--warmup_ratio |
Fraction of steps for LR warmup | 0.05 |
--lr_scheduler_type |
LR schedule: constant, linear, cosine, constant_with_warmup |
"constant_with_warmup" |
| Evaluation & Logging | ||
--evaluation_strategy |
When to run evaluation: no, steps, epoch |
"no" |
--logging_steps |
Logging interval in steps | 1 |
| Checkpointing | ||
--output_dir |
Directory to save checkpoints and weights | "./checkpoints/egnn/" |
--run_name |
Run name for wandb logging | 'egnn-pretrain-family-0419' |
--save_strategy |
When to save checkpoints: steps or epoch |
"steps" |
--save_steps |
Save checkpoint every N steps | 5000 |
--save_total_limit |
Maximum number of checkpoints to keep (older deleted) | 1 |
| Precision & Performance | ||
--bf16 |
Enable bfloat16 mixed-precision training | True |
To evaluate pretrained models on various downstream tasks, please download the dataset from Hugging Face and run the corresponding bash script for each task. The dataset path, pretrained weights, and other parameters are customizable. An example of the bash script arguments is shown below:
torchrun --nproc_per_node=8 --master_port=23333 egnn_protac.py \
--model_name_or_path './checkpoints/egnn_contrastive.pt' \
--data_path "./data/protac_2" \
--bf16 True \
--output_dir "./checkpoints/egnn/" \
--run_name 'egnn-protac-cl-0428' \
--residue_prediction False \
--subseq_length 50 \
--max_nodes 50 \
--temperature 0.01 \
--num_train_epochs 50 \
--seed 1024 \
--load_pretrain True \
--per_device_train_batch_size 24 \
--per_device_eval_batch_size 4 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 5000 \
--save_total_limit 1 \
--learning_rate 5e-4 \
--weight_decay 0. \
--warmup_ratio 0.01 \
--lr_scheduler_type "constant_with_warmup" \
--logging_steps 1 \| Parameter | Description | Example Value |
|---|---|---|
| Distributed & Hardware | ||
--master_port |
TCP port for multi-GPU communication | 23333 |
| Model & Dataset | ||
--load_pretrain |
Whether to load pretrained weights for fine-tuning | True |
| Training Loop | ||
--learning_rate |
Typically smaller than in pretraining | 5e-4 |
--num_train_epochs |
Often fewer epochs than in pretraining | 50 |
--per_device_train_batch_size |
Often smaller than in pretraining | 24 |
| Evaluation & Logging | ||
--seed |
Random seed for reproducibility | 1024 |

