Skip to content

Molecular Property Prediction using Pretrained-BERT and Bayesian Active Learning: A Data-Efficient Approach to Drug Design

License

Notifications You must be signed in to change notification settings

Arslan-Masood/Active-learning-with-BERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Active Learning with BERT for Molecular Property Prediction

This repository contains the implementation of our paper "Molecular Property Prediction using Pretrained-BERT and Bayesian Active Learning: A Data-Efficient Approach to Drug Design".

Active Learning with BERT Architecture

*Figure 1: Overview of our Active Learning with BERT framework for molecular property prediction.*

Table of Contents

Overview

Our framework achieves efficient molecular property prediction by:

  • Leveraging pretrained BERT representations in active learning framework
  • Using Bayesian acquisition functions (BALD, EPIG) for active learning
  • Demonstrating effectiveness on toxicity and ADME property prediction

Installation

  1. Clone the repository:
git clone https://github.com/Arslan-Masood/Active-learning-with-BERT.git
cd Active-learning-with-BERT
  1. Create and activate a conda environment:
conda create -y -q -n ActiveBERT python=3.9.10
conda activate ActiveBERT
  1. Install the required dependencies:
pip install -r requirements.txt

Data

Datasets

We use three benchmark datasets:

  1. Toxicity Datasets:

    • Tox21: Toxicity prediction dataset with 12 different toxicity endpoints
    • ClinTox: Clinical toxicity dataset focusing on drug safety
  2. ADME Dataset:

    • 2 classification datasets from TDC-ADME benchmark:
      • PAMPA Permeability, NCATS
      • Pgp (P-glycoprotein) Inhibition, Broccatelli et al.

Download Instructions

  1. Download the complete datasets_for_active_learning folder from Figshare
  2. This folder contains:
    • Raw molecular data
    • Precomputed BERT features (using MolBERT)
    • Computed Morgan fingerprints
  3. Place the downloaded data in the datasets directory

Usage

Running Experiments

Tox21 Dataset

With BERT Features:

sbatch scripts/Active_learning_Tox21.sh configs/Tox21/BERT/Tox21_BERT.json

With Morgan Fingerprints (ECFP):

sbatch scripts/Active_learning_Tox21.sh configs/Tox21/MF/Tox21_MF.json

ClinTox Dataset

With BERT Features:

sbatch scripts/Active_learning.sh configs/clintox/MolBERT_features/ClinTox_BALD.json

With Morgan Fingerprints (ECFP):

sbatch scripts/Active_learning.sh configs/clintox/Morg_FP_features/ClinTox_BALD.json

ADME Properties

With BERT and ECFP Features:

sbatch /scripts/Ative_learning_ADME.sh /scripts/configs/ADME/ADME.json

Citation

If you use this code in your research, please cite:

@article{masood2025molecular,
    title={Molecular property prediction using pretrained-BERT and Bayesian active learning: a data-efficient approach to drug design},
    author={Masood, Muhammad Arslan, Kaski, Samuel and Cui, Tianyu},
    journal={Journal of Cheminformatics},
    volume={17},
    number={58},
    year={2025},
    doi={10.1186/s13321-025-00986-6},
    url={https://doi.org/10.1186/s13321-025-00986-6}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

About

Molecular Property Prediction using Pretrained-BERT and Bayesian Active Learning: A Data-Efficient Approach to Drug Design

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published