FoldBench is a low-homology benchmark spanning proteins, nucleic acids, ligands, and six major interaction types, enabling assessments that were previously infeasible with task-specific datasets.
- 2025-12-31: The evaluation results for RosettaFold3 (latest) have been updated.
- 2025-12-05: The evaluation results for Boltz-2 and OpenFold3-preview have been updated.
- 2025-12-04: FoldBench has been published in Nature Communications.
The FoldBench benchmark targets are open source. This comprehensive dataset, located in the targets directory, is organized into two primary collections:
- ProteinβProtein: 279 interfaces
- AntibodyβAntigen: 172 interfaces
- ProteinβLigand: 558 interfaces
- ProteinβPeptide: 51 interfaces
- ProteinβRNA: 70 interfaces
- ProteinβDNA: 330 interfaces
- Protein Monomers: 330 structures
- RNA Monomers: 15 structures
- DNA Monomers: 14 structures
Evaluation Metrics: Interface prediction tasks are evaluated by success rate, while monomer prediction tasks use LDDT (Local Distance Difference Test) scores. All results are based on comprehensive evaluations across our low-homology benchmark dataset.
| Model | Protein-Protein | AntibodyβAntigen | Protein-Ligand |
|---|---|---|---|
| AlphaFold 3 | 72.93% | 47.90% | 64.90% |
| Boltz-1 | 68.25% | 33.54% | 55.04% |
| Chai-1 | 68.53% | 23.64% | 51.23% |
| HelixFold 3 | 66.27% | 28.40% | 51.82% |
| Protenix | 68.18% | 34.13% | 50.70% |
| OpenFold 3 (preview) | 69.96% | 28.83% | 44.49% |
| Model | Protein-RNA | Protein-DNA | RNA Monomer | DNA Monomer |
|---|---|---|---|---|
| AlphaFold 3 | 62.32% | 79.18% | 0.61 | 0.53 |
| Boltz-1 | 56.90% | 70.97% | 0.44 | 0.34 |
| Chai-1 | 50.91% | 69.97% | 0.49 | 0.46 |
| HelixFold 3 | 48.28% | 50.00% | 0.55 | 0.29 |
| Protenix | 44.78% | 68.39% | 0.59 | 0.44 |
| OpenFold 3 (preview) | 18.84% | 5.88% | 0.63 | 0.51 |
| Model | Protein-Protein | AntibodyβAntigen | Protein-Ligand |
|---|---|---|---|
| AlphaFold 3 | 70.87% | 47.95% | 67.59% |
| Boltz-1 | 64.10% | 31.43% | 51.33% |
| Chai-1 | 66.95% | 18.31% | 49.28% |
| HelixFold 3 | 66.67% | 28.17% | 50.68% |
| Protenix | 64.80% | 38.36% | 53.25% |
| OpenFold 3 (preview) | 68.22% | 34.29% | 40.85% |
| Boltz-2* | 70.54% | 25.00% | 53.90% |
| RosettaFold3* | 72.44% | 37.50% | 57.28% |
| Model | Protein-RNA | Protein-DNA |
|---|---|---|
| AlphaFold 3 | 72.50% | 80.45% |
| Boltz-1 | 70.00% | 69.77% |
| Chai-1 | 55.56% | 69.14% |
| HelixFold 3 | 54.29% | 61.18% |
| Protenix | 56.41% | 67.63% |
| OpenFold 3 (preview) | 25.00% | 5.81% |
| Boltz-2* | 76.92% | 73.84% |
| RosettaFold3*^ | - | 66.07% |
*Models marked with * have a training cutoff later than FoldBench's reference date (2023-01-13). FoldBench targets are constructed to ensure low homology specifically against the PDB data prior to 2023-01-13. Consequently, models trained on data released after this date may have observed these targets or their close homologs during training (potential data leakage), compromising the low-homology evaluation condition. Results for these models are provided for reference only and should not be directly compared with strictly valid models.
**Nucleic acid monomer results are omitted due to insufficient target availability.
^Results are not shown due to insufficient targets caused by errors during inference or evaluation stages.
Note:
- Interface prediction is evaluated by success rate.
- Monomer prediction is evaluated by LDDT.
- Success is defined as:
- For proteinβligand interfaces: LRMSD < 2 Γ and LDDT-PLI > 0.8
- For all other interfaces: DockQ β₯ 0.23
- We developed an algorithm to identify and prevent overfitting of models on FoldBench, ensuring fair and reliable evaluation.
To get started with FoldBench, clone the repository and set up the Conda environment.
# 1. Clone the repository
git clone https://github.com/BEAM-Labs/FoldBench.git
cd FoldBench
# 2. Create and activate the Conda environment for evaluation
conda env create -f environment.yml
conda activate foldbenchYou can use our provided evaluation samples to reproduce the evaluation workflow. The final results will be generated in examples/summary_table.csv.
# Ensure you are in the FoldBench root directory and the conda environment is active
# Step 1: Calculate per-target scores from prediction files
# This uses OpenStructure (ost) and DockQ to score each prediction against its ground truth
python evaluate.py \
--targets_dir ./examples/targets \
--evaluation_dir ./examples/outputs/evaluation \
--algorithm_name Protenix \
--ground_truth_dir ./examples/ground_truths
# Step 2: Aggregate scores and calculate the final success rates/LDDT
# This summarizes the results for specified models and tasks into a final table
python task_score_summary.py \
--evaluation_dir ./examples/outputs/evaluation \
--target_dir ./examples/targets \
--output_path ./examples/summary_table.csv \
--algorithm_names Protenix \
--targets interface_protein_ligand interface_protein_dna monomer_protein \
--metric_type rankTo evaluate more structures in FoldBench, you'll need to follow these steps:
- Edit the target CSV files: Modify the CSV files located in the
examples/targetsdirectory. These files should contain information about the structures you want to evaluate. - Download ground truth CIF files: A package containing the specific original CIF files referenced during the benchmark's creation is available for download here: FoldBench Referenced CIFs. Save these files in the
examples/ground_truthsdirectory. Ensure the filenames correspond to your data in the CSV files.
- Modify
prediction_reference.csv: After preparing your data, you'll need to adjust the./outputs/evaluation/{algorithm_name}/prediction_reference.csvfile to specify the model's ranking scores and the paths to the predicted structures. Please refer to the Integrating a New Model into FoldBench.
We enthusiastically welcome community submissions!
You can submit your algorithm for us to run the tests.
For detailed instructions on how to package your model for submission, please see the contributor's guide: Integrating a New Model into FoldBench.
The FoldBench repository is organized to separate benchmark data, evaluation code, and evaluation samples.
FoldBench/
βββ targets/ # FoldBench targets csv files
β βββ interface_antibody_antigen.csv
β βββ ...
βββ algorithms/
β βββ algorithm_name/ # Custom model's code and definition files go here
β βββ ...
βββ examples/
β βββ outputs/
β β βββ input/ # Preprocessed inputs for each algorithm
β β β βββ algorithm_name/
β β βββ prediction/ # Model predictions (e.g., .cif files)
β β β βββ algorithm_name/
β β βββ evaluation/ # Final scores and summaries
β β βββ algorithm_name/
β βββ targets/ # Target definitions
β βββ ground_truths/ # Ground truth cif files
β βββ alphafold3_inputs.json # Alphafold3 input json
βββ build_apptainer_images.sh # Script to build all algorithm containers
βββ environment.yml # Conda environment for evaluation scripts
βββ run.sh # Master script to run inference and evaluation
βββ evaluate.py # Prediction evaluation
βββ task_score_summary.py # Benchmark score summary
βββ ...
We gratefully acknowledge the developers of the following projects, which are essential to FoldBench:
This project is licensed under the MIT License - see the LICENSE file for details.
The MIT License is a permissive open source license that allows for commercial and non-commercial use, modification, distribution, and private use of the software, provided that the original copyright notice and license terms are included.
If you use FoldBench in your research, please cite our paper:
@article{xu_benchmarking_2025,
title = {Benchmarking all-atom biomolecular structure prediction with {FoldBench}},
issn = {2041-1723},
url = {https://doi.org/10.1038/s41467-025-67127-3},
doi = {10.1038/s41467-025-67127-3},
journal = {Nature Communications},
author = {Xu, Sheng and Feng, Qiantai and Qiao, Lifeng and Wu, Hao and Shen, Tao and Cheng, Yu and Zheng, Shuangjia and Sun, Siqi},
month = dec,
year = {2025},
}

