|A live demo can be found here
A conda environment can be created and activated using
conda env create -f environment.yaml -n midas
conda activate midasTo launch the MIDAS front end from the project root, first export your OpenAI API key:
export OPENAI_API_KEY=YOUR_OPENAI_API_KEYA required model checkpoint must be downloaded and set before running MIDAS.
You can download the checkpoint and config files from this Google Drive folder.
After downloading, place the checkpoint file and config files in the ./config directory (or update your configuration to point to its location).
Example paths:
CKPT_PATH='./config/chkp_file'
CONFIG_YML='./config/chemt5.ymlThen run:
python frontend/front_end.py- The SMILES string for your target molecule will be entered directly in the frontend user interface.
- Make sure your conda environment is activated and all dependencies are installed before running.
The front end will automatically use the API key from the `OPENAI_API_KEY
Pre-trained models can be downloaded from Zenodo.
wget -P checkpoints/ https://zenodo.org/record/8183747/files/crossdocked_cond.ckptIt will be stored in the ./checkpoints folder.
- Follow instructions to load Crossdocked dataset below
After downloading the dataset run:
python process_crossdock.py /PATH/TO/CROSSDOCK/ --outdir /PATH/TO/OUTDIR/crossdocked_pocket10_proc_noH_ca_only/ --no_H --ca_only- Create language description dataset
python generate_text_descriptions.py PATH/TO/crossdocked_pocket10 --out_csv /PATH/TO/DATAFILE/descriptions.csv --openai_api_key YOUR_API_KEY --openai_model gpt-3.5-turbo --num_procs 32- Combine dataset
import csv
import random
import re
def fix_name(name: str) -> str:
name = re.sub(r"\#\w+", "", name)
try:
second_block = re.search(r"/([^/]+)\.sdf$", name).group(1)
fixed = re.sub(r"/([^/]+)_pocket", f"/{second_block}_pocket", name, count=1)
except:
return name
return fixed
# Input CSVs
SRC_FILES = [
"/PATH/TO/DATAFILE/descriptions.csv",
]
DST_CSV = "/PATH/TO/OUTPUT/descriptons_selected.csv"
# Possible source columns for text (extend if more were added)
USE_COLUMNS = ["text_func", "text_llm",...]
rows = []
# Collect rows from all source CSVs
for src in SRC_FILES:
with open(src) as f_in:
reader = csv.DictReader(f_in)
for row in reader:
# randomly choose one of the available columns
if row["text_func"] != "No prominent functional groups identified.":
col = random.choice(USE_COLUMNS)
rows.append({"name": fix_name(row["name"]), "text": row[col]})
else:
rows.append({"name": fix_name(row["name"]), "text": row["text_llm"]})
# Shuffle them
random.shuffle(rows)
# Write to destination
with open(DST_CSV, "w", newline="") as f_out:
writer = csv.DictWriter(f_out, fieldnames=["name", "text"])
writer.writeheader()
writer.writerows(rows)
print(f"Wrote {len(rows)} rows to {DST_CSV}")- Precompute Embeddings
python precompute_text_embeddings.py --text_csv /PATH/TO/OUTPUT/descriptons_selected.csv --model_name GT4SD/multitask-text-and-chemistry-t5-base-standard --output /PATH/TO/EMBEDDINGS/text_embeddings.npz --batch_size 32 --device cuda- Train model (update respective ckpt.yml paths, make suree checkpoints are downlaoded (see below))
python train.py --config configs/ckpt.yml
--resume /PATH/TO/CHECKPOINT/ckpt.ckptThe code for training and model architecture was adapted from DiffSBDD, and extended with a text conditioning module.

