Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models
We propose ODB-dLLM, an arithmetic intensity inspired framework for accelerating diffusion-based large language models. By analyzing the interleaved compute- and memory-bound phases in existing dLLM inference frameworks, ODB-dLLM introduces adaptive length prediction strategy and jump-share speculative decoding to optimize computation-memory characteristics on hardware platforms, thereby maximizing inference efficiency.
Overall Performance:
ODB-dLLM achieves 46×–162× speedup on LLaDA-Instruct and 50–182× speedup on LLaDA-1.5. Compared with Fast-dLLM, ODB-dLLM also delivers 2.63–6.30× and 2.60–7.22× speedup on the two models, respectively.
1.Clone the repository:
git clone https://github.com/PKU-SEC-Lab/ODB-dLLM.git
cd ODB-dLLM
2.Install dependencies:
conda create --name ODB-dLLM python=3.10
conda activate ODB-dLLM
pip install -r requirements.txt
For LLaDA-8B-Instruct (LLaDA) model:
cd llada_instruct
./eval_llada_instruct.sh <GPU_ID> <Task_Name> 'GSAI-ML/LLaDA-8B-Instruct' <Settings>
For LLaDA-1.5 (LLaDA-1.5) model:
cd llada_1_5
./eval_llada_1_5.sh <GPU_ID> <Task_Name> 'GSAI-ML/LLaDA-1.5' <Settings>
Parameter descriptions:
<GPU_ID>: Choose which GPU to run on<Task_Name>: Select the benchmark for evaluation- Options:
gsm8k, minerva_math, bbh, humaneval, mbpp
- Options:
<Settings>: Select the configuration for evaluation- Options:
llada_baseline: LLaDA baselinefast_dllm_baseline: Fast-dLLM baseline: parallel + dualcachealp: parallel + dualcache + adaptive length predictionaccept-jump: parallel + dualcache + adaptive length prediction + accept-jump speculativejump-share: parallel + dualcache + adaptive length prediction + jump-share speculative
- Options:
Example:
cd llada_instruct
./eval_llada_instruct.sh 0 gsm8k 'GSAI-ML/LLaDA-8B-Instruct' jump-share
If our work assists your research, feel free to give us a star ⭐ or cite us using:
@article{wei2025orchestrating,
title={Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models},
author={Wei, Linye and Chen, Wenjue and Tang, Pingzhi and Guo, Xiaotian and Ye, Le and Wang, Runsheng and Li, Meng},
journal={arXiv preprint arXiv:2511.21759},
year={2025}
}
This repo is largely based on Fast-dLLM. We would also like to thank the authors of LLaDA and LLaDA-1.5 for their excellent work and open-source contributions.
If you have any questions, please contact us via email [email protected].


