Skip to content

pagel-s/MIDAS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MIDAS: Language Controlled Molecular Design, and Analysis

Give it a try:

|A live demo can be found here

Conda environment

A conda environment can be created and activated using

conda env create -f environment.yaml -n midas
conda activate midas

Running the MIDAS Front End

To launch the MIDAS front end from the project root, first export your OpenAI API key:

export OPENAI_API_KEY=YOUR_OPENAI_API_KEY

A required model checkpoint must be downloaded and set before running MIDAS.

You can download the checkpoint and config files from this Google Drive folder.

After downloading, place the checkpoint file and config files in the ./config directory (or update your configuration to point to its location).

Example paths:

CKPT_PATH='./config/chkp_file'
CONFIG_YML='./config/chemt5.yml

Then run:

python frontend/front_end.py
  • The SMILES string for your target molecule will be entered directly in the frontend user interface.
  • Make sure your conda environment is activated and all dependencies are installed before running.

The front end will automatically use the API key from the `OPENAI_API_KEY

To recreate the dataset generation, and model trainig following these instructions:

Pre-trained models

Pre-trained models can be downloaded from Zenodo.

wget -P checkpoints/ https://zenodo.org/record/8183747/files/crossdocked_cond.ckpt

It will be stored in the ./checkpoints folder.

Instructions for MIDAS

  1. Follow instructions to load Crossdocked dataset below

After downloading the dataset run:

python process_crossdock.py /PATH/TO/CROSSDOCK/  --outdir /PATH/TO/OUTDIR/crossdocked_pocket10_proc_noH_ca_only/ --no_H --ca_only
  1. Create language description dataset
python generate_text_descriptions.py PATH/TO/crossdocked_pocket10   --out_csv /PATH/TO/DATAFILE/descriptions.csv   --openai_api_key YOUR_API_KEY   --openai_model gpt-3.5-turbo --num_procs 32
  1. Combine dataset
import csv
import random
import re


def fix_name(name: str) -> str:
    name = re.sub(r"\#\w+", "", name)
    try:
        second_block = re.search(r"/([^/]+)\.sdf$", name).group(1)
        fixed = re.sub(r"/([^/]+)_pocket", f"/{second_block}_pocket", name, count=1)
    except:
        return name
    return fixed

# Input CSVs
SRC_FILES = [
    "/PATH/TO/DATAFILE/descriptions.csv",
]
DST_CSV = "/PATH/TO/OUTPUT/descriptons_selected.csv"

# Possible source columns for text (extend if more were added)
USE_COLUMNS = ["text_func", "text_llm",...]

rows = []

# Collect rows from all source CSVs
for src in SRC_FILES:
    with open(src) as f_in:
        reader = csv.DictReader(f_in)
        for row in reader:
            # randomly choose one of the available columns
            if row["text_func"] != "No prominent functional groups identified.":
                col = random.choice(USE_COLUMNS)
                rows.append({"name": fix_name(row["name"]), "text": row[col]})
            else:
                rows.append({"name": fix_name(row["name"]), "text": row["text_llm"]})

# Shuffle them
random.shuffle(rows)

# Write to destination
with open(DST_CSV, "w", newline="") as f_out:
    writer = csv.DictWriter(f_out, fieldnames=["name", "text"])
    writer.writeheader()
    writer.writerows(rows)

print(f"Wrote {len(rows)} rows to {DST_CSV}")
  1. Precompute Embeddings
python precompute_text_embeddings.py --text_csv /PATH/TO/OUTPUT/descriptons_selected.csv --model_name GT4SD/multitask-text-and-chemistry-t5-base-standard --output /PATH/TO/EMBEDDINGS/text_embeddings.npz --batch_size 32 --device cuda
  1. Train model (update respective ckpt.yml paths, make suree checkpoints are downlaoded (see below))
python train.py --config configs/ckpt.yml 
   --resume /PATH/TO/CHECKPOINT/ckpt.ckpt

Structure Based Diffusion Model & Model Code:

The code for training and model architecture was adapted from DiffSBDD, and extended with a text conditioning module.

Contributors:

Sebastian Pagel David Alobo Michael Jirasek

About

Molecular Intelligence & Design Articulated by Semantics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •