Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing
1 Beihang University, 2 Hong Kong University of Science and Technology, 3 SenseTime Research
[π Paper] | [π Code]
- (02/2025) Focus-dLLM is officially presented. The core implementation and framework details will be released in this repository.
- Supported Models: LLaDA-8B, UltraLLaDA, Dream-7B.
Focus-dLLM is a training-free attention sparsification framework tailored for accurate and efficient long-context inference in Diffusion Large Language Models (dLLMs).
While dLLMs introduce a compelling non-autoregressive paradigm via iterative denoising, the considerable computational cost of bidirectional full attention limits their inference efficiency. Existing sparse attention methods remain ineffective for dLLMs because they require estimating attention importance for tokens yet to be decoded, while the unmasked token positions are unknown during the diffusion process.
To address this, Focus-dLLM introduces a novel pipeline that accurately predicts unmasked regions and retains only necessary computation. Our approach delivers more than 29Γ lossless speedup under 32K context length, offering a superior Pareto frontier between throughput and generation quality.
Figure 1: Overview of Focus-dLLM. We employ a past confidence-guided indicator to predict unmasked positions and leverage a sink-aware pruning strategy to dynamically identify and reuse attention sinks.
We discover that token confidence exhibits a strong positive correlation across adjacent denoising steps. Based on this, we design an indicator that uses confidence scores from step
To eliminate redundant computation, we propose a sink-aware pruning strategy. This mechanism:
- Dynamic Attention Sink Identification: Explicitly identifies and retains attention sinks to preserve generation quality.
- Cross-Layer Consistency: Leverages the observation that attention sinks match across layers. We identify sinks at an intermediate layer and reuse their locations for subsequent sparse layers, avoiding repeated re-identification overhead.
- Block-wise Pruning: Applies dynamic pruning to Key/Value states of the prompt tokens to keep the most relevant history while retaining all response tokens.
Focus-dLLM achieves robust performance across LongBench. On UltraLLaDA, our method achieves the highest average score, outperforming vanilla baselines and existing acceleration frameworks.
Focus-dLLM demonstrates superior scalability. As context length grows, the speedup ratio expands significantly, reaching 29.6Γ at 32K context.
- Refactor and release the official implementation code.
- Support more dLLM architectures.
Our framework is built upon the open-source project Fast-dLLM. We also incorporate ideas and implementation details inspired by several pioneering works in the dLLM and sparse attention community:
Furthermore, our experiments leverage diffusion language models from:
We sincerely thank the authors for their open-source contributions, which greatly facilitated the development of Focus-dLLM.

