CANDI (Continuous and Discrete Diffusion) is a novel hybrid diffusion model that combines continuous and discrete noise processes for high-quality text generation. This approach bridges the gap between continuous diffusion models used in computer vision and discrete token-based language modeling, offering the best of both worlds.
CANDI uses a hybrid kernel to coordinate both discrete and continuous corruption explicitly throughout training:
This code base is built on the DUO codebase, which is available at this link: https://github.com/s-sahoo/duo?tab=readme-ov-file.
- Clone the repository:
git clone https://github.com/patrickpynadath1/candi.git
cd candi- Install dependencies:
pip install -r requirements.txt- (Optional) Install Flash Attention for faster training:
pip install flash-attn --no-build-isolation- Download the OWT data
bash manual_download.shThis codebase includes the code for running experiments on Text8 and OWT. We will integrate the QM9 experiments later. In general, we re-use the same experimental methodology and codebase from https://github.com/kuleshov-group/discrete-diffusion-guidance.
We include scripts for training models in scripts/slurm_scripts.
Run temperature sweeps for frontier analysis using the following scripts:
# OpenWebText sweeps
bash scripts/gen_ppl_owt_candi_sweep.sh
# Text8 sweeps
bash scripts/gen_text8_candi_sweep.shThe project uses Hydra for configuration management. Key configuration files:
- Algorithm configs:
configs/algo/- Different diffusion algorithms (CANDI, MDLM, SEDD, etc.) - Data configs:
configs/data/- Dataset configurations - Model configs:
configs/model/- Model architecture settings - Base config:
configs/config.yaml- Main configuration file
algo.py: Core algorithm implementations (CANDI, MDLM, DUO, etc.)main.py: Main training and evaluation scriptdataloader.py: Data loading and preprocessing utilitiesmodels/: Model architectures (DiT)metrics.py: Evaluation metrics and utilitiestrainer_base.py: Base trainer class with common functionality
If you use this code in your research, please cite:
@article{pynadath2025candi,
title={CANDI: Hybrid Discrete-Continuous Diffusion Models},
author={Patrick Pynadath, Jiaxin Shi, and Ruqi Zhang},
journal={arXiv preprint},
year={2025}
}For more details, visit our project page or check out the paper.

