Stephen Bates1,† | Tommi Jaakkola1,†
- [2025/09/18] HDLM is accepted to NeurIPS 2025!
- [2025/10/12] Paper is available on arXiv!
- [2025/10/12] Code is released!
We present Hierarchical Diffusion Language Model (HDLM), a novel framework for training discrete diffusion models via time-varying next-semantic scale prediction. HDLM extends standard Masked Diffusion Model (MDM) by introducing intermediate hierarchies (termed cluster tokens) in between clean tokens and masked tokens. In the forward process, each token is independently perturbed to its higher-level ancestor with more abstract semantics according to the scheduler, while in the reverse process the model progressively predicts the next, more detailed semantics. Taken together, HDLM provides a general time-varying next semantic scale prediction process for language modeling. We derive closed-form expressions for the diffusion Evidence Lower Bound (ELBO), and show that HDLM can be implemented in a flexible manner while including the existing MDM as a special case. This repository contains all training and evaluation code necessary for reproducing the results in the paper.
Set up the environment:
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt && pip install -e .You can download our precalculated files in hdlm/clusters for existing numbers of clusters in [1, 2, 4, 8, 16, 32, 64, 128, 256] with GPT-2 tokenizer on OpenWebText dataset using GIDD pretrained models, or preprocess by running hdlm/compute_cluster.py for customed numbers of clusters / tokenizers / datasets / pretrained models. Make sure the names / paths of these cluster files match cluster_dict_path, cluster_embed_path and pretrained_model_name in your training configs as in the examples.
To reproduce the training runs from the paper, you can use the following commands.
In this example, we are training on a single node with 8 GPUs, feel free to adjust the --nnodes and --nproc_per_node arguments to match your setup.
Whenever needed, feel free to change the checkpoint saving directory by adjusting save_dir in hdlm/configs/logging/default.yaml, and data storage directory by cache_dir in hdlm/configs/data/defaults.yaml.
Key hyperparameters include:
-
cluster_size: number of clusters ($n$ in the paper) -
gamma: forward process schedule ($\gamma$ in the paper) -
p_perturb: probability of stochastic perturbations ($1-\xi$ in the paper)
You are also welcome to try out other model / training / loss hyperparameters.
(optional) Log into W&B with wandb login for experiment tracking or other disable via wandb disabled.
# HDLM-small-64
torchrun --nnodes 1 --nproc_per_node 8 hdlm/train.py --config-name hdlm-small-cluster_64-gamma_1.0-xi_1.0 logging.run_name="'small-hdlm-cluster_64-gamma_1.0-xi_1.0-owt'"
# GIDD+ baseline
torchrun --nnodes 1 --nproc_per_node 8 hdlm/train.py --config-name gidd logging.run_name="'small-gidd+-owt-pu=0.0'"
# MDLM baseline
torchrun --nnodes 1 --nproc_per_node 8 hdlm/train.py --config-name mdlm logging.run_name="'small-mdlm-owt'"
# AR baseline
torchrun --nnodes 1 --nproc_per_node 8 hdlm/train.py --config-name ar logging.run_name="'small-ar-owt'"There are also a couple of scripts to run inference and evaluate the trained models.
The following command will generate num_samples=256 samples in num_denoising_steps=512 iterations from the model checkpoint located at path and save them to samples_dir=samples.pt.
python hdlm/eval/generate_samples.py path=./outputs/path/to/checkpoint/ samples_dir=samples.pt num_samples=256 num_denoising_steps=512 batch_size=16Given a file containing samples generated with the generate_samples.py script, the following command will compute the generative PPL.
Here we assume that the diffusion model used to generate samples located at samples.pt uses the gpt2 tokenizer, and we compute generative PPL using gpt2-large as a reference model.
The results will be saved to metrics_path=metrics.json.
python hdlm/eval/generative_ppl.py samples_path=samples.pt model_tokenizer=gpt2 pretrained_model=gpt2-large batch_size=1 metrics_path=metrics.jsonA simple helper script to compute the loss of a trained model on the entire validation split.
python hdlm/eval/loss.py path=./outputs/path/to/checkpoint/ batch_size=32If you find our work helpful, please consider giving a star ⭐ and citation 📝
@article{zhou2025next,
title={Next Semantic Scale Prediction via Hierarchical Diffusion Language Models},
author={Zhou, Cai and Wang, Chenyu and Zhang, Dinghuai and Tong, Shangyuan and Wang, Yifei and Bates, Stephen and Jaakkola, Tommi},
journal={arXiv preprint arXiv:2510.08632},
year={2025}
}The code is built upon the below repositories, we thank all the contributors for open-sourcing.
