This repository contains the implementation of our paper "Molecular Property Prediction using Pretrained-BERT and Bayesian Active Learning: A Data-Efficient Approach to Drug Design".
*Figure 1: Overview of our Active Learning with BERT framework for molecular property prediction.*Our framework achieves efficient molecular property prediction by:
- Leveraging pretrained BERT representations in active learning framework
- Using Bayesian acquisition functions (BALD, EPIG) for active learning
- Demonstrating effectiveness on toxicity and ADME property prediction
- Clone the repository:
git clone https://github.com/Arslan-Masood/Active-learning-with-BERT.git
cd Active-learning-with-BERT- Create and activate a conda environment:
conda create -y -q -n ActiveBERT python=3.9.10
conda activate ActiveBERT- Install the required dependencies:
pip install -r requirements.txtWe use three benchmark datasets:
-
Toxicity Datasets:
- Tox21: Toxicity prediction dataset with 12 different toxicity endpoints
- ClinTox: Clinical toxicity dataset focusing on drug safety
-
ADME Dataset:
- 2 classification datasets from TDC-ADME benchmark:
- PAMPA Permeability, NCATS
- Pgp (P-glycoprotein) Inhibition, Broccatelli et al.
- 2 classification datasets from TDC-ADME benchmark:
- Download the complete
datasets_for_active_learningfolder from Figshare - This folder contains:
- Raw molecular data
- Precomputed BERT features (using MolBERT)
- Computed Morgan fingerprints
- Place the downloaded data in the
datasetsdirectory
With BERT Features:
sbatch scripts/Active_learning_Tox21.sh configs/Tox21/BERT/Tox21_BERT.jsonWith Morgan Fingerprints (ECFP):
sbatch scripts/Active_learning_Tox21.sh configs/Tox21/MF/Tox21_MF.jsonWith BERT Features:
sbatch scripts/Active_learning.sh configs/clintox/MolBERT_features/ClinTox_BALD.jsonWith Morgan Fingerprints (ECFP):
sbatch scripts/Active_learning.sh configs/clintox/Morg_FP_features/ClinTox_BALD.jsonWith BERT and ECFP Features:
sbatch /scripts/Ative_learning_ADME.sh /scripts/configs/ADME/ADME.jsonIf you use this code in your research, please cite:
@article{masood2025molecular,
title={Molecular property prediction using pretrained-BERT and Bayesian active learning: a data-efficient approach to drug design},
author={Masood, Muhammad Arslan, Kaski, Samuel and Cui, Tianyu},
journal={Journal of Cheminformatics},
volume={17},
number={58},
year={2025},
doi={10.1186/s13321-025-00986-6},
url={https://doi.org/10.1186/s13321-025-00986-6}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Muhammad Arslan Masood
- Email: [email protected]
- Institution: Aalto University
