GitHub - carlosinator/tabasco: A Fast, Simplified Model for Molecular Generation with Improved Physical Quality

A Fast, Simplified Model for Molecular Generation with Improved Physical Quality

Carlos Vonessen*, Charles Harris*, Miruna Cretu*, Pietro Liò

GenBio Workshop @ ICML 2025, *Core contributor

Main Contributions:

State-of-the-art performance on PoseBusters (link)
10x speed-up at sampling time (see Table 1)
More parameter efficient (see Figure 1)
Standard non-equivariant Transformer
Lean and extensible implementation

Getting Started

Introduction to Repo: This repository is based on the lightning hydra template, where you can find an introduction on hydra for pytorch and general usage instructions.

Downloading datasets: The processed datasets are available for GEOM-Drugs and QM9. Move all splits to src/data without renaming. Running src/train.py for the first time will generate the lmdb dataset, which only happens once and can take about an hour.

Checkpoints: We currently provide checkpoints for two models trained on GEOM-Drugs: TABASCO-mild (3.7M) and TABASCO-hot (15M). More to follow!

Installation

conda env create -f environment.yaml
conda activate tabasco

Training

The training configs are available under configs/experiment, which overwrite the defaults in the other configs/* folders. To train the TABASCO-hot model from the paper, you can run:

python src/train.py experiment=hot_geom trainer=gpu

Multi-GPU Training is available via torchrun and trainer parameters are customizable in configs/trainer. You may want to pass additional command line arguments to torchrun depending on your setup. For example for two GPUs on one node using DDP (assuming a suitable ddp.yaml config) you can run

torchrun --nproc_per_node=2 --nnodes=1 src/train.py experiment=hot_geom trainer=ddp

Sampling

We provide two scripts for sampling from a model checkpoint, as well as some convenient parameters to modify. Unconditional sampling is called with:

python src/sample.py \
    --num_mols 1000 --num_steps 100 \
    --checkpoint path/to/model.ckpt \
    --output_path path/to/output/folder

Boosting Physical Plausibility: This is a script for sampling molecules with boosted physical quality (Section 3.5). Where guidance encodes the step size of each gradient step, step-switch the point at which to switch to UFF bound guidance, and to-center whether to regress to the interval center.

python src/sample_uff_bounds.py \
    --guidance 0.01 --step-switch 90 --to-center False \
    --ckpt path/to/model.ckpt --output-dir path/to/output/folder

Repository Summary

Model Architecture

The model uses a deliberately simplified non-equivariant Transformer that treats molecular generation as a sequence modeling problem (see the positional encodings). Coordinates and atom types are jointly embedded with time and positional encodings, then processed through standard Transformer blocks. No explicit bond information is included and the model relies on generating physically sensible coordinates so that standard chemoinformatics tools can infer bonds reliably. Optional cross-attention layers allow separate processing of coordinate and atom type domains before final MLP heads predict the outputs. The full model implementation is easily extensible compared to specialized equivariant architectures.

Interpolant Class

We combine the required interpolant functionality in one base Interpolant class to make the code more readable and extensible. In practice, we found that this significantly increases iteration speed and improves verifiability. The SDEMetricInterpolant manages coordinate flows with configurable noise scaling and centering, while DiscreteInterpolant handles categorical atom types in the discrete diffusion framework. Each interpolant defines four key operations: noise sampling, path creation between data points, loss computation, and explicit-Euler stepping during generation. This modular design allows mixing different interpolation strategies for different molecular properties while maintaining a unified training loop.

Citation

@article{vonessen2025tabasco,
      title={TABASCO: A Fast, Simplified Model for Molecular Generation with Improved Physical Quality}, 
      author={Carlos Vonessen and Charles Harris and Miruna Cretu and Pietro Liò},
      year={2025},
      url={https://arxiv.org/abs/2507.00899}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
configs		configs
data		data
figures		figures
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.project-root		.project-root
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Fast, Simplified Model for Molecular Generation with Improved Physical Quality

Main Contributions:

Getting Started

Installation

Training

Sampling

Repository Summary

Model Architecture

Interpolant Class

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A Fast, Simplified Model for Molecular Generation with Improved Physical Quality

Main Contributions:

Getting Started

Installation

Training

Sampling

Repository Summary

Model Architecture

Interpolant Class

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages