Project Page | arXiv | Models
Example generation on AIME24. RCD increases parallelism by 4x while maintaining the baseline's peak accuracy.
This repository contains the code to replicate our study in "Residual Context Diffusion Language Models". In this study, we point out that diffusion Large Language Models (dLLMs) enable parallel decoding but often trail autoregressive models in accuracy. A key culprit is the inference-time remasking strategy that commits only high-confidence tokens and discards the rest, wasting intermediate computation.
RCD introduces a residual denoising mechanism that turns discarded token distributions into contextual residuals and injects them into the next denoising step. With a two-stage training pipeline, RCD avoids backprop-through-time memory costs while preserving the benefits of residual feedback.
- [2025/02] Project page, arXiv and models are published.
TL;DR: RCD consistently improves diffusion reasoning accuracy over Sequential Denoising (SeqD) across both SDAR and LLaDA, with the biggest gains on harder competition-style benchmarks (AIME24/25) and MinervaMath.
- Models: SDAR 4B / 8B, block size b=32 / 64 (KV cache reuse)
- Eval: SeqD/RCD use 16,384 sequence length; Chat uses 512 tokens (and 1,024 for AIME); confidence threshold = 0.85
| Model | Variant | GSM8K1 | MATH500 | AIME24 | AIME25 |
|---|---|---|---|---|---|
| SDAR-4B-b32 | Chat2 | 86.13 | 50.20 | 5.83 | 2.50 |
| SDAR-4B-b32 | SeqD | 81.73 | 61.20 | 6.04 | 11.88 |
| SDAR-4B-b32 | RCD | 85.67 | 70.80 | 11.04 | 17.50 |
| SDAR-4B-b64 | Chat2 | 85.90 | 49.80 | 6.25 | 1.67 |
| SDAR-4B-b64 | SeqD | 78.85 | 56.80 | 4.17 | 7.29 |
| SDAR-4B-b64 | RCD | 84.76 | 67.80 | 13.75 | 15.83 |
| SDAR-8B-b32 | Chat2 | 88.40 | 50.00 | 6.46 | 4.17 |
| SDAR-8B-b32 | SeqD | 86.50 | 65.80 | 11.67 | 14.79 |
| SDAR-8B-b32 | RCD | 89.76 | 77.60 | 21.46 | 20.00 |
| SDAR-8B-b64 | Chat2 | 88.32 | 51.60 | 5.20 | 2.50 |
| SDAR-8B-b64 | SeqD | 82.87 | 64.20 | 7.08 | 9.79 |
| SDAR-8B-b64 | RCD | 88.70 | 73.60 | 15.00 | 19.79 |
- Eval: sequence length 512, single-token-per-step decoding
| Model | Variant | GSM8K | MinervaMath |
|---|---|---|---|
| LLaDA | Base3 | 70.30 | 31.40 |
| LLaDA | SeqD | 75.74 | 31.10 |
| LLaDA | RCD | 78.09 | 37.00 |
We provide all checkpoints of our models!
For sequential denoising dLLMs (standard SFT from base models):
| Name | URL |
|---|---|
| SeqD-SDAR-4B-b32-Thinking | model |
| SeqD-SDAR-4B-b64-Thinking | model |
| SeqD-SDAR-8B-b32-Thinking | model |
| SeqD-SDAR-8B-b64-Thinking | model |
| SeqD-LLaDA-8B-Instruct | model |
For residual denoising dLLMs (a SeqD reference is required to warm start the generation):
| Name | URL | Ref Model | URL |
|---|---|---|---|
| RCD-SDAR-4B-b32-Thinking | model | SeqD-SDAR-1.7B-b32-Thinking | model |
| RCD-SDAR-4B-b64-Thinking | model | SeqD-SDAR-1.7B-b64-Thinking | model |
| RCD-SDAR-8B-b32-Thinking | model | SeqD-SDAR-1.7B-b32-Thinking | model |
| RCD-SDAR-8B-b64-Thinking | model | SeqD-SDAR-1.7B-b64-Thinking | model |
| RCD-LLaDA-8B-Instruct | model | SeqD-LLaDA-8B-Instruct | model |
The minimal implementation for text generation can be found in generate*.py. This file runs with only the standard transformers library as a dependency:
pip install transformers==4.52.3
# Running sequential denoising
CUDA_VISIBLE_DEVICES=0 python SDAR-ref/generate_seqd.py \
--model_dir yuezhouhu/SeqD-SDAR-4B-b64-Thinking \
--trust_remote_code \
--block_length 64 \
--denoising_steps 64 \
--temperature 0 \
--dtype bfloat16 \
--confidence_threshold 0.85
# Running residual denoising
CUDA_VISIBLE_DEVICES=0 python SDAR-target/generate_rcd.py \
--model_dir yuezhouhu/RCD-SDAR-4B-b64-Thinking \
--ref_model_dir yuezhouhu/SeqD-SDAR-1.7B-b64-Thinking \
--trust_remote_code \
--block_length 64 \
--denoising_steps 64 \
--temperature 0 \
--dtype bfloat16 \
--confidence_threshold 0.85We provide the full training and evaluation code to reproduce our results.
LLaDA-ref/: Reference Model (and baseline Sequential Denoising LLaDA model) code and configs.LLaDA-target/: Target Model code and configs.SDAR-ref/: Reference Model (and baseline Sequential Denoising SDAR models) code and configs.SDAR-target/: Target Model code and configs.
Each sub-project is self-contained and has its own environment:
- LLaDA reference:
./LLaDA-ref/README.md - LLaDA target:
./LLaDA-target/README.md - SDAR reference:
./SDAR-ref/README.md - SDAR target:
./SDAR-target/README.md
- LLaDA:
- Eval script(s):
LLaDA-*/examples/llada/eval_openmathinstruct.sh
- Eval script(s):
- SDAR:
- Eval scripts:
SDAR-*/eval_simple.sh,SDAR-*/eval_aime.sh
- Eval scripts:
Training recipes live in each sub-project:
- LLaDA:
LLaDA-*/examples/llada/run.sh - SDAR:
SDAR-*/run.sh
@misc{hu2026residualcontextdiffusionlanguage,
title={Residual Context Diffusion Language Models},
author={Yuezhou Hu and Harman Singh and Monishwaran Maheswaran and Haocheng Xi and Coleman Hooper and Jintao Zhang and Aditya Tomar and Michael W. Mahoney and Sewon Min and Mehrdad Farajtabar and Kurt Keutzer and Amir Gholami and Chenfeng Xu},
year={2026},
eprint={2601.22954},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.22954},
}

