Carlos Vonessen*,
Charles Harris*,
Miruna Cretu*,
Pietro Liò
GenBio Workshop @ ICML 2025, *Core contributor
- State-of-the-art performance on PoseBusters (link)
- 10x speed-up at sampling time (see Table 1)
- More parameter efficient (see Figure 1)
- Standard non-equivariant Transformer
- Lean and extensible implementation
Introduction to Repo: This repository is based on the lightning hydra template, where you can find an introduction on hydra for pytorch and general usage instructions.
Downloading datasets: The processed datasets are available for GEOM-Drugs and QM9. Move all splits to src/data without renaming. Running src/train.py for the first time will generate the lmdb dataset, which only happens once and can take about an hour.
Checkpoints: We currently provide checkpoints for two models trained on GEOM-Drugs: TABASCO-mild (3.7M) and TABASCO-hot (15M). More to follow!
conda env create -f environment.yaml
conda activate tabascoThe training configs are available under configs/experiment, which overwrite the defaults in the other configs/* folders. To train the TABASCO-hot model from the paper, you can run:
python src/train.py experiment=hot_geom trainer=gpuMulti-GPU Training is available via torchrun and trainer parameters are customizable in configs/trainer. You may want to pass additional command line arguments to torchrun depending on your setup. For example for two GPUs on one node using DDP (assuming a suitable ddp.yaml config) you can run
torchrun --nproc_per_node=2 --nnodes=1 src/train.py experiment=hot_geom trainer=ddpWe provide two scripts for sampling from a model checkpoint, as well as some convenient parameters to modify. Unconditional sampling is called with:
python src/sample.py \
--num_mols 1000 --num_steps 100 \
--checkpoint path/to/model.ckpt \
--output_path path/to/output/folderBoosting Physical Plausibility: This is a script for sampling molecules with boosted physical quality (Section 3.5). Where guidance encodes the step size of each gradient step, step-switch the point at which to switch to UFF bound guidance, and to-center whether to regress to the interval center.
python src/sample_uff_bounds.py \
--guidance 0.01 --step-switch 90 --to-center False \
--ckpt path/to/model.ckpt --output-dir path/to/output/folderThe model uses a deliberately simplified non-equivariant Transformer that treats molecular generation as a sequence modeling problem (see the positional encodings). Coordinates and atom types are jointly embedded with time and positional encodings, then processed through standard Transformer blocks. No explicit bond information is included and the model relies on generating physically sensible coordinates so that standard chemoinformatics tools can infer bonds reliably. Optional cross-attention layers allow separate processing of coordinate and atom type domains before final MLP heads predict the outputs. The full model implementation is easily extensible compared to specialized equivariant architectures.
We combine the required interpolant functionality in one base Interpolant class to make the code more readable and extensible. In practice, we found that this significantly increases iteration speed and improves verifiability. The SDEMetricInterpolant manages coordinate flows with configurable noise scaling and centering, while DiscreteInterpolant handles categorical atom types in the discrete diffusion framework. Each interpolant defines four key operations: noise sampling, path creation between data points, loss computation, and explicit-Euler stepping during generation. This modular design allows mixing different interpolation strategies for different molecular properties while maintaining a unified training loop.
@article{vonessen2025tabasco,
title={TABASCO: A Fast, Simplified Model for Molecular Generation with Improved Physical Quality},
author={Carlos Vonessen and Charles Harris and Miruna Cretu and Pietro Liò},
year={2025},
url={https://arxiv.org/abs/2507.00899},
}

