Skip to content

yuezhouhu/residual-context-diffusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Residual Context Diffusion (RCD)

Unleashing the Potential of Diffusion LLMs via Residual Denoising

Project Page | arXiv | Models

train_infer_0126

example

Example generation on AIME24. RCD increases parallelism by 4x while maintaining the baseline's peak accuracy.

Introduction

This repository contains the code to replicate our study in "Residual Context Diffusion Language Models". In this study, we point out that diffusion Large Language Models (dLLMs) enable parallel decoding but often trail autoregressive models in accuracy. A key culprit is the inference-time remasking strategy that commits only high-confidence tokens and discards the rest, wasting intermediate computation.

RCD introduces a residual denoising mechanism that turns discarded token distributions into contextual residuals and injects them into the next denoising step. With a two-stage training pipeline, RCD avoids backprop-through-time memory costs while preserving the benefits of residual feedback.

News

  • [2025/02] Project page, arXiv and models are published.

Performance

TL;DR: RCD consistently improves diffusion reasoning accuracy over Sequential Denoising (SeqD) across both SDAR and LLaDA, with the biggest gains on harder competition-style benchmarks (AIME24/25) and MinervaMath.

SDAR

  • Models: SDAR 4B / 8B, block size b=32 / 64 (KV cache reuse)
  • Eval: SeqD/RCD use 16,384 sequence length; Chat uses 512 tokens (and 1,024 for AIME); confidence threshold = 0.85
Model Variant GSM8K1 MATH500 AIME24 AIME25
SDAR-4B-b32 Chat2 86.13 50.20 5.83 2.50
SDAR-4B-b32 SeqD 81.73 61.20 6.04 11.88
SDAR-4B-b32 RCD 85.67 70.80 11.04 17.50
SDAR-4B-b64 Chat2 85.90 49.80 6.25 1.67
SDAR-4B-b64 SeqD 78.85 56.80 4.17 7.29
SDAR-4B-b64 RCD 84.76 67.80 13.75 15.83
SDAR-8B-b32 Chat2 88.40 50.00 6.46 4.17
SDAR-8B-b32 SeqD 86.50 65.80 11.67 14.79
SDAR-8B-b32 RCD 89.76 77.60 21.46 20.00
SDAR-8B-b64 Chat2 88.32 51.60 5.20 2.50
SDAR-8B-b64 SeqD 82.87 64.20 7.08 9.79
SDAR-8B-b64 RCD 88.70 73.60 15.00 19.79

LLaDA

  • Eval: sequence length 512, single-token-per-step decoding
Model Variant GSM8K MinervaMath
LLaDA Base3 70.30 31.40
LLaDA SeqD 75.74 31.10
LLaDA RCD 78.09 37.00

Model Zoo

We provide all checkpoints of our models!

For sequential denoising dLLMs (standard SFT from base models):

Name URL
SeqD-SDAR-4B-b32-Thinking model
SeqD-SDAR-4B-b64-Thinking model
SeqD-SDAR-8B-b32-Thinking model
SeqD-SDAR-8B-b64-Thinking model
SeqD-LLaDA-8B-Instruct model

For residual denoising dLLMs (a SeqD reference is required to warm start the generation):

Name URL Ref Model URL
RCD-SDAR-4B-b32-Thinking model SeqD-SDAR-1.7B-b32-Thinking model
RCD-SDAR-4B-b64-Thinking model SeqD-SDAR-1.7B-b64-Thinking model
RCD-SDAR-8B-b32-Thinking model SeqD-SDAR-1.7B-b32-Thinking model
RCD-SDAR-8B-b64-Thinking model SeqD-SDAR-1.7B-b64-Thinking model
RCD-LLaDA-8B-Instruct model SeqD-LLaDA-8B-Instruct model

Run RCD

The minimal implementation for text generation can be found in generate*.py. This file runs with only the standard transformers library as a dependency:

pip install transformers==4.52.3

# Running sequential denoising
CUDA_VISIBLE_DEVICES=0 python SDAR-ref/generate_seqd.py \
  --model_dir yuezhouhu/SeqD-SDAR-4B-b64-Thinking \
  --trust_remote_code \
  --block_length 64 \
  --denoising_steps 64 \
  --temperature 0 \
  --dtype bfloat16 \
  --confidence_threshold 0.85

# Running residual denoising
CUDA_VISIBLE_DEVICES=0 python SDAR-target/generate_rcd.py \
  --model_dir yuezhouhu/RCD-SDAR-4B-b64-Thinking \
  --ref_model_dir yuezhouhu/SeqD-SDAR-1.7B-b64-Thinking \
  --trust_remote_code \
  --block_length 64 \
  --denoising_steps 64 \
  --temperature 0 \
  --dtype bfloat16 \
  --confidence_threshold 0.85

Reproducing Results

We provide the full training and evaluation code to reproduce our results.

Repository Layout

  • LLaDA-ref/: Reference Model (and baseline Sequential Denoising LLaDA model) code and configs.
  • LLaDA-target/: Target Model code and configs.
  • SDAR-ref/: Reference Model (and baseline Sequential Denoising SDAR models) code and configs.
  • SDAR-target/: Target Model code and configs.

Installation

Each sub-project is self-contained and has its own environment:

  • LLaDA reference: ./LLaDA-ref/README.md
  • LLaDA target: ./LLaDA-target/README.md
  • SDAR reference: ./SDAR-ref/README.md
  • SDAR target: ./SDAR-target/README.md

Evaluation

  • LLaDA:
    • Eval script(s): LLaDA-*/examples/llada/eval_openmathinstruct.sh
  • SDAR:
    • Eval scripts: SDAR-*/eval_simple.sh, SDAR-*/eval_aime.sh

Training

Training recipes live in each sub-project:

  • LLaDA: LLaDA-*/examples/llada/run.sh
  • SDAR: SDAR-*/run.sh

Citation

@misc{hu2026residualcontextdiffusionlanguage,
      title={Residual Context Diffusion Language Models}, 
      author={Yuezhou Hu and Harman Singh and Monishwaran Maheswaran and Haocheng Xi and Coleman Hooper and Jintao Zhang and Aditya Tomar and Michael W. Mahoney and Sewon Min and Mehrdad Farajtabar and Kurt Keutzer and Amir Gholami and Chenfeng Xu},
      year={2026},
      eprint={2601.22954},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.22954}, 
}

Footnotes

  1. We observed potential data contamination in the original Chat models on GSM8K, which may inflate the Chat baseline.

  2. Chat variants are instruction-following models; SeqD/RCD are further adapted for mathematical reasoning. 2 3 4

  3. LLaDA-Base results are taken from the original LLaDA paper.

About

Residual Context Diffusion (RCD): Repurposing discarded signals as structured priors for high-performance reasoning in dLLMs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors