TamGen

TamGen: Target-aware Molecule Generation for Drug Design Using a Chemical Language Model

Note: This repository will no longer be updated. For the latest updates and developments, please visit the official repository at https://github.com/microsoft/tamgen.

Introduction

This is the implementation of the paper TamGen: Target-aware Molecule Generation for Drug Design Using a Chemical Language Model.

Our implementation is built on fairseq-v0.8.0

Installation

conda create -n TamGen python=3.9
conda activate TamGen

bash setup_env.sh

Dataset

Build training data for CrossDocked dataset

Please refer to the README in the folder data

Build customized dataset

You can build your customized dataset through the following methods:

Build customized dataset based on pdb ids using the center coordinates of the binding site of each pdb.
```
python scripts/build_data/prepare_pdb_ids_center.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} -t ${THRESHOLD}
```
- PDB_ID_LIST format: CSV format with the following columns: pdb_id,center_x,center_y,center_z,[uniprot_id]. [uniprot_id] is optional.
- DATASET_NAME: You could specify it by yourselv. The simplest way is to set it as test.
- OUTPUT_PATH: The output path of the processed data.
- THRESHOLD: The radius of the pocket region whose center is center_x,center_y,center_z.
Build customized dataset based on pdb ids, the script will automatically find the binding sites according to the ligands in the structure file.
```
python scripts/build_data/prepare_pdb_ids.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} -t ${threshold}
```
- PDB_ID_LIST format: CSV format with columns pdb_id,[ligand_inchi,uniprot_id], where [] means optional.
- THRESHOLD: A residue $r$ is considered part of the pocket region, if any atom in $r$ lies within THRESHOLD angstroms of a ligand atom. For a given pdb_id, its associated ligands can be found in database/PdbCCD.
- The remaining parameters are the same as those in method 1.
Build customized dataset based on pdb ids using the center coordinates of the binding site of each pdb, and add the provided scaffold to each center
```
python scripts/build_data/prepare_pdb_ids_center_scaffold.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} -t ${THRESHOLD} --scaffold-file ${SCAFFOLD_FILE}
```
- SCAFFOLD_FILE: It contains molecular scaffolds that will be incorporated into the processed database. These scaffolds serve as structural templates for subsequent conditional generation of new molecules.
- The remaining parameters are the same as those in method 1.
For customized pdb strcuture files, you can put your structure files to the --pdb-path folder, and in the PDB_ID_LIST csv file, put the filenames in the pdb_id column.

We provide an example about how to build and use customized data in customized_example.

Model

The checkpoint can be found in https://doi.org/10.5281/zenodo.13751391. Please download checkpoints.zip & gpt_model.zip and uncompress them. After that, you will get two folders: checkpoints and gpt_model. Please place them under the folder TamGen/. The structures of the two folders are shown below:

checkpoints/
├── README.MD
├── crossdock_pdb_A10
│   └── checkpoint_best.pt
└── crossdocked_model
    └── checkpoint_best.pt

gpt_model/
├── checkpoint_best.pt
└── dict.txt

Run scripts

Training

# train a new model
bash scripts/train.sh -D ${DATA_PATH} --savedir ${SAVED_MODEL_PATH}

For example, one can run bash scripts/train.sh -D data/crossdocked/bin/ --savedir crossdock_train --fp16 to train models.

Inference

One can refer to scripts/generate.sh for running inference code.

We provide an example by running bash scripts/example_inference.sh

Demo

We provide a demo at interactive_decode.ipynb

In the first cell of the demo

from TamGen_Demo import TamGenDemo, prepare_pdb_data
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

worker = TamGenDemo(
    data="./TamGen_Demo_Data",
    ckpt="checkpoints/crossdock_pdb_A10/checkpoint_best.pt"
)

Specify the GPU id
Download the checkpoint and place it into "checkpoints/crossdock_pdb_A10/checkpoint_best.pt" or your specificied position
Download the pre-trained GPT model and put it into the folder gpt_model

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
TamGen_Demo_Data		TamGen_Demo_Data
customized_example		customized_example
data		data
database/PdbCCD		database/PdbCCD
dict		dict
fairseq		fairseq
fairseq_cli		fairseq_cli
gpt_model		gpt_model
misc		misc
model		model
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TamGen_Demo.py		TamGen_Demo.py
eval_lm.py		eval_lm.py
generate.py		generate.py
generate_multiseed.py		generate_multiseed.py
hubconf.py		hubconf.py
interactive.py		interactive.py
interctive_decode.ipynb		interctive_decode.ipynb
preprocess.py		preprocess.py
score.py		score.py
setup.py		setup.py
setup_env.sh		setup_env.sh
train.py		train.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TamGen

Introduction

Installation

Dataset

Build training data for CrossDocked dataset

Build customized dataset

Model

Run scripts

Training

Inference

Demo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

SigmaGenX/TamGen

Folders and files

Latest commit

History

Repository files navigation

TamGen

Introduction

Installation

Dataset

Build training data for CrossDocked dataset

Build customized dataset

Model

Run scripts

Training

Inference

Demo

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages