This repository (Daily updating) provides a curated list of papers on Diffusion Large Language Models (dLLMs), a rapidly emerging field in generative AI. The collection is organized to track advancements from foundational theory to state-of-the-art applications.
The field is evolving quickly, and this list is a living document. We welcome community contributions. If you know of a relevant paper we've missed, please feel free to submit a pull request.
- Theoretical Basis & Discussions
- Foundation Model
- Inference Method
- Training Method
- Multimodal Model
- Variable Length
- Others
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-02-13 | Theoretical Benefit and Limitation of Diffusion Language Model | - | Paper | NeurIPS 2025 |
| 2015-03-12 | Deep Unsupervised Learning using Nonequilibrium Thermodynamics | - | Paper | - |
| 2021-07-07 | Structured Denoising Diffusion Models in Discrete State-Spaces | - | Paper | - |
| 2023-10-15 | On the Reasoning Abilities of Masked Diffusion Language Models | Full AbstractMasked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent to their parallelism remain largely unexplored. To this end, we characterize what types of reasoning problems MDMs can provably solve and how efficiently. We do this by connecting MDMs to the well-understood reasoning frameworks of chain of thought (CoT) and padded looped transformers (PLTs) in the finite-precision log-width setting: We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting, and that MDMs can solve all problems that CoT-augmented transformers can. Moreover, we showcase classes of problems (including regular languages) for which MDMs are inherently more efficient than CoT transformers, where parallel generation allows for substantially faster reasoning. |
Paper | Under review in ICLR'26 |
| 2023-10-25 | Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution | - | Paper | - |
| 2024-06-06 | Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data | - | Paper | - |
| 2024-06-06 | Simplified and Generalized Masked Diffusion for Discrete Data | - | Paper | - |
| 2024-06-11 | Simple and Effective Masked Diffusion Language Models | - | Paper | - |
| 2025-09-19 | Breaking AR’s Sampling Bottleneck: Provable Acceleration via Diffusion Language Models | - | Paper | NeurIPS 2025 |
| 2025-09-20 | Diffusion Language Models are Provably Optimal Parallel Samplers | Full AbstractDiffusion language models (DLMs) have emerged as a promising alternative to autoregressive models for faster inference via parallel token generation. We provide a rigorous foundation for this advantage by formalizing a model of parallel sampling and showing that DLMs augmented with polynomial-length chain-of-thought (CoT) can simulate any parallel sampling algorithm using an optimal number of sequential steps. Consequently, whenever a target distribution can be generated using a small number of sequential steps, a DLM can be used to generate the distribution using the same number of optimal sequential steps. However, without the ability to modify previously revealed tokens, DLMs with CoT can still incur large intermediate footprints. We prove that enabling remasking (converting unmasked tokens to masks or revision (converting unmasked tokens to other unmasked tokens) together with CoT further allows DLMs to simulate any parallel sampling algorithm with optimal space complexity. We further justify the advantage of revision by establishing a strict expressivity gap: DLMs with revision or remasking are strictly more powerful than those without. Our results not only provide a theoretical justification for the promise of DLMs as the most efficient sampler, but also advocate for why revisions should be enabled in DLMs. |
Paper | Under review in ICLR'26 |
| 2025-10-13 | Next Semantic Scale Prediction via Hierarchical Diffusion Language Models | - | Paper | NeurIPS 2025 |
| 2025-10-16 | Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models | Full AbstractLanguage models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have demonstrated that these architectures can scale to modern language modeling tasks while exhibiting advantages in reasoning tasks. In this work, we examine the relationship between recurrent-depth models and diffusion language models. Building on their similarities, we develop a new diffusion forcing sampler for these models to accelerate generation. The sampler advances by decoding new tokens at every forward pass of the model, while the latent states of these tokens can be further refined in parallel through recurrence. Theoretically, generation with our sampler is strictly more expressive than the baseline autoregressive generation using the same time budget on modern hardware. Moreover, this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers without any tuning, leading to up to a 5x speedup. Consequently, our findings not only provide an efficient mechanism for parallelizing the extra computation in recurrent-depth models at inference, but also suggest that such models can be naturally viewed as strong continuous, though causal, diffusion language models. |
Paper | Under review in ICLR'26 |
| 2025-10-29 | Error Bounds and Optimal Schedules for Masked Diffusions with Factorized Approximations | - | Paper | - |
| 2025-09-20 | Scaling Behavior of Discrete Diffusion Language Models | Full AbstractModern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs. We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-constrained training environments. We scale our uniform diffusion model up to 10B parameters trained for FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date. In the process of deriving the scaling laws, we reformulate the discrete diffusion ELBO in terms of signal-to-noise ratio, closing the gap to continuous diffusion theory and simplifying both theory and implementation. Training code and models are open-sourced: upon acceptance. |
Paper | Under review in ICLR'26 |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-12-05 | Understanding the Limitations of Diffusion LLMs through a Probabilistic Perspective | - | Notion | |
| 2025-12-05 | Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture | Full AbstractLarge language models (LLMs) predominantly use autoregressive (AR) approaches, but masked diffusion models (MDMs) are emerging as viable alternatives. A key challenge in comparing AR and MDM paradigms is their typical architectural difference: AR models are often decoder-only, while MDMs have largely been encoder-only. This practice of changing both the modeling paradigm and architecture simultaneously makes direct comparisons unfair, as it's hard to distinguish whether observed differences stem from the paradigm itself or the architectural shift. This research evaluates MDMs within a decoder-only framework to: (1) equitably compare MDM (as Any-Order AR, or AO-AR) and standard AR paradigms. Our investigation suggests that the standard AO-AR objective, which averages over all token permutations, may benefit from refinement, as many permutations appear less informative compared to the language's inherent left-to-right structure. (2) Investigate architectural influences (decoder-only vs. encoder-only) within MDMs. We demonstrate that while encoder-only MDMs model a simpler conditional probability space, decoder-only MDMs can achieve dramatic generation speedups () and comparable perplexity with temperature annealing despite modeling a vastly larger space, highlighting key trade-offs. This work thus decouples core paradigm differences from architectural influences, offering insights for future model design. Code is available at this https URL. |
Paper | |
| 2025-10-10 | Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs | - | Paper | - |
| 2025-12-07 | From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs | Full AbstractLarge language models (LLMs) excel at generation but dominant autoregressive (AR) decoding is inherently sequential, creating a throughput bottleneck. Diffusion Language Models (DLMs)--especially block-wise variants--enable parallel generation and intra-block bidirectional reasoning, yet training large DLMs from scratch is costly and wastes the knowledge in mature AR checkpoints. Prior "adaptation" attempts either modify logits or randomly grow attention masks to full-sequence diffusion, or simply transplant AR weights into a block-diffusion recipe, leaving a fundamental mismatch between AR causality and block-wise bidirectionality unaddressed. We reframe adaptation as a intra-paradigm path from AR to Block-Diffusion by viewing AR as Block-Diffusion with blocksize=1. Concretely, we design the pathway of adaptation as follows: we use a context-causal attention mask (causal in context, bidirectional only within the active block), an efficient parallel adaptation procedure, an auxiliary AR loss to maximize data utilization and retain pretrained knowledge, and gradual increment of the generation block size. The recipe integrates cleanly with masked block-diffusion and maintains train-inference consistency. Built on these components, NBDiff-7B (Base and Instruct) could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance among the 7B-class DLMs, delivering strong gains on general-knowledge, math, and code benchmarks over strong baselines. These results demonstrate that principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training DLMs from scratch. Codes: https://github.com/YuchuanTian/NBDiff. |
Paper | - |
| 2025-12-11 | Scaling Behavior of Discrete Diffusion Language Models | Full AbstractModern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs. We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-bound settings. We scale our uniform diffusion model up to 10B parameters trained for |
Paper | - |
| 2025-12-27 | On the Role of Discreteness in Diffusion LLMs | Full AbstractDiffusion models offer appealing properties for language generation, such as parallel decoding and iterative refinement, but the discrete and highly structured nature of text challenges the direct application of diffusion principles. In this paper, we revisit diffusion language modeling from the view of diffusion process and language modeling, and outline five properties that separate diffusion mechanics from language-specific requirements. We first categorize existing approaches into continuous diffusion in embedding space and discrete diffusion over tokens. We then show that each satisfies only part of the five essential properties and therefore reflects a structural trade-off. Through analyses of recent large diffusion language models, we identify two central issues: (i) uniform corruption does not respect how information is distributed across positions, and (ii) token-wise marginal training cannot capture multi-token dependencies during parallel decoding. These observations motivate diffusion processes that align more closely with the structure of text, and encourage future work toward more coherent diffusion language models. |
Paper | - |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-02-13 | Theoretical Benefit and Limitation of Diffusion Language Model | - | Paper | NeurIPS 2025 |
| 2015-03-12 | Deep Unsupervised Learning using Nonequilibrium Thermodynamics | - | Paper | - |
| 2021-07-07 | Structured Denoising Diffusion Models in Discrete State-Spaces | - | Paper | - |
| 2023-10-25 | Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution | - | Paper | - |
| 2024-06-06 | Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data | - | Paper | - |
| 2024-06-06 | Simplified and Generalized Masked Diffusion for Discrete Data | - | Paper | - |
| 2024-06-11 | Simple and Effective Masked Diffusion Language Models | - | Paper | - |
| 2025-09-19 | Breaking AR’s Sampling Bottleneck: Provable Acceleration via Diffusion Language Models | - | Paper | NeurIPS 2025 |
| 2025-10-13 | Next Semantic Scale Prediction via Hierarchical Diffusion Language Models | - | Paper | NeurIPS 2025 |
| 2025-12-10 | 🚀🔥 LLaDA2.0: Scaling Up Diffusion Language Models to 100B | Full AbstractThis paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced. |
Paper | Huggingface, Code |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-109-24 | FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models | Full AbstractAutoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parallelize across positions and thus appear promising for language generation, yet standard discrete diffusion typically needs hundreds to thousands of model evaluations to reach high quality, trading serial depth for iterative breadth. We introduce FS-DFM, Few-Step Discrete Flow-Matching. A discrete flow-matching model designed for speed without sacrificing quality. The core idea is simple: make the number of sampling steps an explicit parameter and train the model to be consistent across step budgets, so one big move lands where many small moves would. We pair this with a reliable update rule that moves probability in the right direction without overshooting, and with strong teacher guidance distilled from long-run trajectories. Together, these choices make few-step sampling stable, accurate, and easy to control. On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1,024-step discrete-flow baseline for generating 1,024 tokens using a similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains. |
Paper | Under reciew in ICLR'26 |
| 2025-11-23 | Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone | - | Paper | - |
| 2025-12-28 | WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference | Full AbstractAutoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching and forces repeated contextualization, undermining efficiency. We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this property, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix and maintains a fixed parallel workload, avoiding the stop-and-wait behavior common in block diffusion methods. Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3x on challenging reasoning benchmarks and up to 10x in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice. |
Paper | - |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2022-05-27 | Diffusion-LM Improves Controllable Text Generation | - | Paper | - |
| 2025-05-30 | DLM-One: Diffusion Language Models for One-Step Sequence Generation | - | Paper | - |
| 2025-10-03 | Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner | Full AbstractDiffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped transformers or continuous chain-of-thoughts, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped transformers lack, they introduce additional difficulty decoding tokens into the discrete token space from the continuous representation space. We therefore propose Coevolutionary Continuous Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on real-world tasks. |
Paper | Under review in ICLR'26 |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-09-20 | LATTS: LAtent space Test Time Scaling for diffusion language models | Full AbstractTest-time scaling (TTS) improves the performance of autoregressive (AR) large language models by adding computation at inference. While the prominent sequential TTS enhances accuracy by inducing models to generate longer chain-of-thought (CoT) reasoning, its computational overhead emerges as a drawback. Meanwhile, diffusion large language models (DLLMs) have emerged as a promising alternative that offers parallel decoding and self-correction capabilities. However, existing sequential TTS methods are incompatible with modern masked DLLMs. This incompatibility arises from two fundamental constraints: (1) standard DLLMs operate holistically on fixed-length sequences, preventing the dynamic token-level expansion required for CoT \revision{without specific training}, and (2) the intrinsic coupling between refinement (i.e., denoising) steps and sequence length in standard DLLM formulations, restricting effective extension without delicate designs. We introduce LATTS, a novel sequential TTS method for DLLMs that addresses the above challenges by operating in the latent embedding space. LATTS reframes CoT reasoning from a \emph{spatial} process of extending sequence length to a \emph{temporal} process that uses additional computation to extend the iterative self-refinement steps over the entire sequence's latent representation. Our evaluation on the LLaDA-Instruct model shows that: \revision{with a brief post-training phase}, LATTS achieves notable improvements over SFT baselines on reasoning and code generation benchmarks with gains of +4.1% on GSM8K, +4.8% on MATH, +3.2% on MBPP, and an average of +4.6% on commonsense reasoning tasks with minimal additional inference computation. These results establish sequential TTS as a promising technique for optimizing DLLMs. |
Paper | Under review in ICLR'26 |
| 2025-10-17 | Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning | - | Paper | - |
| 2025-10-31 | Diffuse Thinking: Exploring Diffusion Language Models as Efficient Thought Proposers for Reasoning | - | Paper | - |
| 2025-11-04 | Effective Test-Time Scaling of Discrete Diffusion through Iterative Refinement | - | Paper | - |
| 2025-11-04 | Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models | - | Paper | Under review in ICLR'26 |
| 2025-11-12 | TiDAR: Think in Diffusion, Talk in Autoregression | Full AbstractMasked Diffusion Models (MDMs) as language models generate by iteratively unmasking tokens, yet their performance crucially depends on the inference time order of unmasking. Prevailing heuristics, such as confidence based sampling, are myopic: they optimize locally, fail to leverage extra test-time compute, and let early decoding mistakes cascade. We propose Lookahead Unmasking (LookUM), which addresses these concerns by reformulating sampling as path selection over all possible unmasking orders without the need for an external reward model. Our framework couples (i) a path generator that proposes paths by sampling from pools of unmasking sets with (ii) a verifier that computes the uncertainty of the proposed paths and performs importance sampling to subsequently select the final paths. Empirically, erroneous unmasking measurably inflates sequence level uncertainty, and our method exploits this to avoid error-prone trajectories. We validate our framework across six benchmarks, such as mathematics, planning, and coding, and demonstrate consistent performance improvements. LookUM requires only two to three paths to achieve peak performance, demonstrating remarkably efficient path selection. The consistent improvements on both LLaDA and post-trained LLaDA 1.5 are particularly striking: base LLaDA with LookUM rivals the performance of RL-tuned LLaDA 1.5, while LookUM further enhances LLaDA 1.5 itself showing that uncertainty based verification provides orthogonal benefits to reinforcement learning and underscoring the versatility of our framework. Code will be publicly released. |
Paper | - |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-05-21 | dKV-Cache: The Cache for Diffusion Language Models | - | Paper | NeurIPS 2025 |
| 2025-05-22 | dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching | - | GitHub | - |
| 2025-05-27 | FlashDLM: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion | Full AbstractDiffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference. While they match the quality of similarly sized autoregressive (AR) models (e.g., Qwen2.5 7B, Llama3 8B), their iterative denoising requires multiple full-sequence forward passes, resulting in high computational costs and latency, particularly for long input prompts and long-context scenarios. Furthermore, parallel token generation introduces token incoherence problems, and current sampling heuristics suffer from significant quality drops with decreasing denoising steps. We address these limitations with two training-free techniques. First, we propose FreeCache, a Key-Value (KV) approximation caching technique that reuses stable KV projections across denoising steps, effectively reducing the computational cost of DLM inference. Second, we introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking, dramatically reducing the total number of denoising iterations without sacrificing quality. We conduct extensive evaluations on open-source reasoning benchmarks, and our combined methods deliver an average of 12.14x end-to-end speedup across various tasks with negligible accuracy degradation. For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models. Our work successfully paved the way for scaling up the diffusion language model to a broader scope of applications across different domains. |
Under review in ICLR'26 | |
| 2025-05-28 | Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding | - | - | |
| 2025-06-02 | Esoteric Language Models | - | Paper | - |
| 2025-08-04 | Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction | - | Paper | - |
| 2025-10-13 | dInfer: An Efficient Inference Framework for Diffusion Language Models | - | Paper | - |
| 2025-10-16 | Attention Is All You Need for KV Cache in Diffusion LLMs | - | Paper | - |
| 2025-11-24 | Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models | - | Paper | - |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-05-22 | Path Planning for Masked Diffusion Model Sampling | Full AbstractAny order generation of discrete data using masked diffusion models (MDMs) offers a compelling alternative to traditional autoregressive models, especially in domains that lack a natural causal ordering of data. However, current popular MDMs depart from their successful continuous diffusion model counterparts with simplified masked inference wherein unmasked tokens cannot be iteratively refined -- even if there is a mistake. In this paper, we extract the full power of MDMs by introducing a novel inference sampling strategy termed Path Planning (P2) that decomposes each generation step into two sub-stages: planning and denoising. Under P2, the planner at every step selects appropriate tokens that are marked to be updated, which can then be sampled using the denoiser. We demonstrate that P2 generalizes all existing sampling strategies for MDMs and critically enhances generative quality through the new capability of refining and updating existing unmasked tokens. We theoretically prove that P2 establishes a (new) expanded evidence lower bound (ELBO) on the log marginal likelihood of data. We instantiate P2 with a family of planners including: 1.) Self-Planning, 2.) BERT-Planning, and 3.) Trained-Planning with a learned planner leading to SOTA generative performance for MDMs on a suite of domains. Specifically, solely using P2 inference, we observe relative improvements of 22% in protein sequence foldability, 8% in RNA sequence pLDDT, 4% in math reasoning, 68% in story generation (ROUGE score), and 33% in code generation for the challenging pass@1 metric. |
Paper | Under review in ICLR'26 |
| 2025-05-22 | Remasking Discrete Diffusion Models with Inference-Time Scaling | - | Paper | - |
| 2025-05-22 | Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding | - | Paper | - |
| 2025-05-23 | Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling | - | Paper | - |
| 2025-05-27 | Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion | - | - | |
| 2025-05-28 | Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding | - | - | |
| 2025-05-30 | Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking | - | Paper | - |
| 2025-05-31 | Accelerating Diffusion LLMs via Adaptive Parallel Decoding | - | Paper | - |
| 2025-06-12 | Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles | - | Paper | - |
| 2025-06-23 | Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models | Full AbstractMasked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP), general‐knowledge (BBH, MMLU-Pro), and instruction following (IFEval) benchmarks, DUS outperforms confidence‐based planners, without modifying the underlying denoiser, and reveals the true speed-quality frontier of MDLMs. |
Paper | Under review in ICLR'26 |
| 2025-07-11 | Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling | Full AbstractDiscrete diffusion models have recently emerged as strong alternatives to autoregressive language models, matching their performance through large-scale training. However, inference-time control remains relatively underexplored. In this work, we study how to steer generation toward desired rewards without retraining the models. Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement. We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity under reward optimization. PG-DLM constructs a Markov chain over full denoising trajectories and applies a conditional sequential Monte Carlo kernel to resample them. We derive theoretical guarantees for convergence, including asymptotic consistency and variance bounds. Within this framework, we further analyze trade-offs across four key axes for inference-time scaling under fixed budgets: iterations, samples, denoising steps, and reward estimation. Our analysis shows scaling iterations achieves the best reward-perplexity trade-off. Empirically, PG-DLM consistently outperforms prior methods using MDLM and LLaDA-8B as base models across a wide range of compute budgets for reward-guided generation tasks including toxicity and sentiment control as well as linguistic acceptability. |
Paper | Under review in ICLR'26 |
| 2025-07-24 | Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs | - | Paper | - |
| 2025-08-19 | DPad: Efficient Diffusion Language Models with Suffix Dropout | Full AbstractDiffusion-based Large Language Models (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose Diffusion Scratchpad (DPad), a training-free method that restricts attention to a small set of nearby suffix tokens, preserving fidelity while eliminating redundancy. DPad integrates two strategies: (i) a sliding window, which maintains a fixed-length suffix window, and (ii) distance-decay dropout, which deterministically removes distant suffix tokens before attention computation. This simple design is compatible with existing optimizations such as prefix caching and can be implemented with only a few lines of code. Comprehensive evaluations across multiple benchmarks on LLaDA-1.5 and Dream models demonstrate that DPad delivers up to \mathbf{61.4\times} speedup over vanilla dLLMs while maintaining comparable accuracy, highlighting its potential for efficient and scalable long-sequence inference. Our code is available at this https URL. |
Paper | Under review in ICLR'26 |
| 2025-08-27 | Diffusion Language Models Know the Answer Before Decoding | Full AbstractDiffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders. However, their inference remains slower than that of autoregressive models, primarily due to the cost of bidirectional attention and the large number of refinement steps required for high quality outputs. In this work, we highlight and leverage an overlooked property of DLMs early answer convergence: in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding. Specifically, Prophet dynamically decides whether to continue refinement or to go "all-in" (i.e., decode all remaining tokens in one step), using the confidence gap between the top-2 prediction candidates as the criterion. It integrates seamlessly into existing DLM implementations, incurs negligible overhead, and requires no additional training. Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality. These results recast DLM decoding as a problem of when to stop sampling, and demonstrate that early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, complementary to existing speedup techniques. Our code is publicly available at this https URL. |
Paper | Under review in ICLR'26 |
| 2025-09-04 | Improving Diffusion Language Model Reasoning through Joint Search in Generation Order and Token Space | Full AbstractThe order-agnostic generation of Diffusion Language Models (DLMs) presents a promising alternative to autoregressive models for complex reasoning. We model reasoning as traversals of a problem-specific graph of logical dependencies, and view DLM decoding as sampling trajectories from a joint space over generation orders and token values. We show that standard decoding heuristics such as low-confidence remasking collapse this reasoning space. To address this, we introduce Order-Token Search, an algorithm that jointly searches over token content and generation order. Its core is a likelihood estimation function that scores block-level denoising actions, enabling stable path pruning. This allows for efficient exploration of diverse reasoning trajectories. Extensive experiments on mathematical reasoning and planning benchmarks show that our method consistently outperforms baselines, matching or surpassing the gains of fully post-trained d1-LLaDA with diffu-GRPO on Countdown, GSM8K, and MATH500 (e.g. achieving a 13.7% absolute gain on Countdown). Our work establishes structured search as a key missing component for advancing reasoning in DLMs. |
Paper | Under review in ICLR'26 |
| 2025-10-08 | Accelerating Diffusion LLM Inference via Local Determinism Propagation | Full AbstractDiffusion large language models (dLLMs) represent a significant advancement in text generation, offering parallel token decoding capabilities. However, existing open-source implementations suffer from quality-speed trade-offs that impede their practical deployment. Conservative sampling strategies typically decode only the most confident token per step to ensure quality (i.e., greedy decoding), at the cost of inference efficiency due to repeated redundant refinement iterations--a phenomenon we term delayed decoding. Through systematic analysis of dLLM decoding dynamics, we characterize this delayed decoding behavior and propose a training-free adaptive parallel decoding strategy, named LocalLeap, to address these inefficiencies. LocalLeap is built on two fundamental empirical principles: local determinism propagation centered on high-confidence anchors and progressive spatial consistency decay. By applying these principles, LocalLeap identifies anchors and performs localized relaxed parallel decoding within bounded neighborhoods, achieving substantial inference step reduction through early commitment of already-determined tokens without compromising output quality. Comprehensive evaluation on various benchmarks demonstrates that LocalLeap achieves 6.94 throughput improvements and reduces decoding steps to just 14.2% of the original requirement, achieving these gains with negligible performance impact. The source codes are available at: this https URL. |
Paper | Under review in ICLR'26 |
| 2025-10-13 | Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference | - | Paper | - |
| 2025-10-09 | Guided Star-Shaped Masked Diffusion | - | Paper | - |
| 2025-10-21 | Planned Diffusion | - | Paper | - |
| 2025-10-21 | How Efficient Are Diffusion Language Models? A Critical Examination of Efficiency Evaluation Practices | - | Paper | - |
| 2025-10-20 | Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model | - | Paper | - |
| 2025-10-22 | Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking | Full AbstractMasked diffusion models (MDM) are powerful generative models for discrete data that generate samples by progressively unmasking tokens in a sequence. Each token can take one of two states: masked or unmasked. We observe that token sequences often remain unchanged between consecutive sampling steps; consequently, the model repeatedly processes identical inputs, leading to redundant computation. To address this inefficiency, we propose the Partial masking scheme (Prime), which augments MDM by allowing tokens to take intermediate states interpolated between the masked and unmasked states. This design enables the model to make predictions based on partially observed token information, and facilitates a fine-grained denoising process. We derive a variational training objective and introduce a simple architectural design to accommodate intermediate-state inputs. Our method demonstrates superior performance across a diverse set of generative modeling tasks. On text data, it achieves a perplexity of 15.36 on OpenWebText, outperforming previous MDM (21.52), autoregressive models (17.54), and their hybrid variants (17.58), without relying on an autoregressive formulation. On image data, it attains competitive FID scores of 3.26 on CIFAR-10 and 6.98 on ImageNet-32, comparable to leading continuous generative models. |
Paper | Under review in ICLR'26 |
| 2025-11-03 | Beyond Static Cutoffs: One-Shot Dynamic Thresholding for Diffusion Language Models | - | Paper | - |
| 2025-11-07 | KLASS: KL-Guided Fast Inference in Masked Diffusion Models | - | Paper | - |
| 2025-11-22 | WavefrontDiffusion: Dynamic Decoding Schedule for Improved Reasoning | Full AbstractDiffusion Language Models (DLMs) have shown strong potential for text generation and are becoming a competitive alternative to autoregressive models. The denoising strategy plays an important role in determining the quality of their outputs. Mainstream denoising strategies include Standard Diffusion and BlockDiffusion. Standard Diffusion performs global denoising without restricting the update range, often finalizing incomplete context and causing premature end-of-sequence predictions. BlockDiffusion updates fixed-size blocks in a preset order, but its rigid structure can break apart coherent semantic units and disrupt reasoning. We present WavefrontDiffusion, a dynamic decoding approach that expands a wavefront of active tokens outward from finalized positions. This adaptive process follows the natural flow of semantic structure while keeping computational cost equal to block-based methods. Across four benchmarks in reasoning and code generation, WavefrontDiffusion achieves state-of-the-art performance while producing outputs with higher semantic fidelity, showing the value of adaptive scheduling for more coherent and efficient generation. |
Paper | Under review in ICLR'26 |
| 2025-11-26 | From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models | - | Paper | - |
| 2025-12-03 | Decoding Large Language Diffusion Models with Foreseeing Movement | Full AbstractLarge Language Diffusion Models (LLDMs) benefit from a flexible decoding mechanism that enables parallelized inference and controllable generations over autoregressive models. Yet such flexibility introduces a critical challenge: inference performance becomes highly sensitive to the decoding order of tokens. Existing heuristic methods, however, focus mainly on local effects while overlooking long-term impacts. To address this limitation, we propose the Foreseeing Decoding Method (FDM), a novel approach that integrates both local and global considerations to unlock the full potential, employing a search-based strategy to enable effective optimization in discrete spaces. Furthermore, by analyzing the consistency of chosen tokens in the full decoding process, we develop a variant, FDM with Acceleration (FDM-A), which restricts deep exploration to critical steps identified as the exploration and balance circumantences. Extensive experiments across diverse benchmarks and model architectures validate the scalability of FDM and demonstrate the superior efficiency-performance trade-off achieved by FDM-A. Our work might potentially provide a principled step toward more powerful decoding methods for LLDMs. |
Paper | - |
| 2025-12-08 | Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration | Full AbstractWe present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy. |
Paper | - |
| 2025-12-13 | Diffusion Language Model Inference with Monte Carlo Tree Search | Full AbstractDiffusion language models (DLMs) have recently emerged as a compelling alternative to autoregressive generation, offering parallel generation and improved global coherence. During inference, DLMs generate text by iteratively denoising masked sequences in parallel; however, determining which positions to unmask and which tokens to commit forms a large combinatorial search problem. Existing inference methods approximate this search using heuristics, which often yield suboptimal decoding paths; other approaches instead rely on additional training to guide token selection. To introduce a principled search mechanism for DLMs inference, we introduce MEDAL, a framework that integrates Monte Carlo Tree SEarch initialization for Diffusion LAnguage Model inference. We employ Monte Carlo Tree Search at the initialization stage to explore promising unmasking trajectories, providing a robust starting point for subsequent refinement. This integration is enabled by restricting the search space to high-confidence actions and prioritizing token choices that improve model confidence over remaining masked positions. Across multiple benchmarks, MEDAL achieves up to 22.0% improvement over existing inference strategies, establishing a new paradigm for search-based inference in diffusion language models. |
Paper | - |
| 2025-12-18 | LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding | Full AbstractDiffusion Large Language Models (dLLMs) have demonstrated significant potential for high-speed inference. However, current confidence-driven decoding strategies are constrained by limited parallelism, typically achieving only 1--3 tokens per forward pass (TPF). In this work, we identify that the degree of parallelism during dLLM inference is highly sensitive to the Token Filling Order (TFO). Then, we introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior TFO and hence accelerate inference. LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence. We apply LoPA to the state-of-the-art D2F model and observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline. Furthermore, to facilitate this unprecedented degree of parallelism, we develop a specialized multi-device inference system featuring Branch Parallelism (BP), which achieves a single-sample throughput of 1073.9 tokens per second under multi-GPU deployment. The code is available at https://github.com/zhijie-group/LoPA. |
Paper | - |
| 2025-12-22 | Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models | Full AbstractDiffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization. We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding. |
Paper | - |
| 2026-01-05 | Deferred Commitment Decoding for Diffusion Language Models with Confidence-Aware Sliding Windows | Full AbstractDiffusion language models (DLMs) have recently emerged as a strong alternative to autoregressive models by enabling parallel text generation. To improve inference efficiency and KV-cache compatibility, prior work commonly adopts block-based diffusion, decoding tokens block by block. However, this paradigm suffers from a structural limitation that we term Boundary-Induced Context Truncation (BICT): undecoded tokens near block boundaries are forced to commit without access to nearby future context, even when such context could substantially reduce uncertainty. This limitation degrades decoding confidence and generation quality, especially for tasks requiring precise reasoning, such as mathematical problem solving and code generation. We propose Deferred Commitment Decoding (DCD), a novel, training-free decoding strategy that mitigates this issue. DCD maintains a confidence-aware sliding window over masked tokens, resolving low-uncertainty tokens early while deferring high-uncertainty tokens until sufficient contextual evidence becomes available. This design enables effective bidirectional information flow within the decoding window without sacrificing efficiency. Extensive experiments across multiple diffusion language models, benchmarks, and caching configurations show that DCD improves generation accuracy by 1.39% with comparable time on average compared to fixed block-based diffusion methods, with the most significant improvement reaching 9.0%. These results demonstrate that deferring token commitment based on uncertainty is a simple yet effective principle for improving both the quality and efficiency of diffusion language model decoding. |
Paper | - |
| 2025-12-07 | STDD:Spatio-Temporal Dynamics-Driven Token Refinement in Diffusion Language Models | Full AbstractUnlike autoregressive language models, diffusion language models (DLMs) generate text by iteratively denoising all token positions in parallel. At each timestep, the remasking strategy of a DLM selects low-priority tokens to defer their decoding, thereby improving both efficiency and output quality. However, mainstream remasking strategies rely on a single global confidence threshold, overlooking the temporal and spatial dynamics of individual tokens. Motivated by the redundant iterations and constrained parallelism introduced by fixed-threshold remasking, we propose a novel remasking approach that dynamically detects Temporal Variance and Spatial Deviance of each token, which reflect its convergence status and inter-token correlations. Using these signals, our method adaptively adjusts the confidence threshold for every token at every step. Empirical results show that our approach significantly improves the operational efficiency of DLMs across mainstream datasets, achieving speedups of up to 8.9 times while faithfully preserving generation quality. |
Paper | - |
| 2026-01-15 | Discrete Feynman-Kac Correctors | Full AbstractDiscrete diffusion models have recently emerged as a promising alternative to the autoregressive approach for generating discrete sequences. Sample generation via gradual denoising or demasking processes allows them to capture hierarchical non-sequential interdependencies in the data. These custom processes, however, do not assume a flexible control over the distribution of generated samples. We propose Discrete Feynman-Kac Correctors, a framework that allows for controlling the generated distribution of discrete masked diffusion models at inference time. We derive Sequential Monte Carlo (SMC) algorithms that, given a trained discrete diffusion model, control the temperature of the sampled distribution (i.e. perform annealing), sample from the product of marginals of several diffusion processes (e.g. differently conditioned processes), and sample from the product of the marginal with an external reward function, producing likely samples from the target distribution that also have high reward. Notably, our framework does not require any training of additional models or fine-tuning of the original model. We illustrate the utility of our framework in several applications including: efficient sampling from the annealed Boltzmann distribution of the Ising model, improving the performance of language models for code generation and amortized learning, as well as reward-tilted protein sequence generation. |
Paper | - |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-03-10 | Discrete Diffusion Language Model for Efficient Text Summarization | Full AbstractWhile diffusion models excel at conditional generating high-quality images, prior works in discrete diffusion models were not evaluated on conditional long-text generation. In this work, we address the limitations of prior discrete diffusion models for conditional long-text generation, particularly in long sequence-to-sequence tasks such as abstractive summarization. Despite fast decoding speeds compared to autoregressive methods, previous diffusion models failed on the abstractive summarization task due to the incompatibility between the backbone architectures and the random noising process. To overcome these challenges, we introduce a novel semantic-aware noising process that enables Transformer backbones to handle long sequences effectively. Additionally, we propose CrossMamba, an adaptation of the Mamba model to the encoder-decoder paradigm, which integrates seamlessly with the random absorbing noising process. Our approaches achieve state-of-the-art performance on three benchmark summarization datasets: Gigaword, CNN/DailyMail, and Arxiv, outperforming existing discrete diffusion models on ROUGE metrics as well as possessing much faster speed in inference compared to autoregressive models. |
Paper | NAACL 2025 |
| 2025-04-16 | d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning | - | Paper | - |
| 2025-05-15 | Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models | - | Paper | NeurIPS 2025 |
| 2025-05-24 | Anchored Diffusion Language Model | - | Paper | NeurIPS 2025 |
| 2025-05-25 | LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models | - | Paper | - |
| 2025-07-07 | wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models | - | Paper | Under review in ICLR'26 |
| 2025-07-25 | DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation | - | Paper | - |
| 2025-08-18 | MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models | - | - | |
| 2025-08-27 | Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding | Full AbstractDiscrete diffusion language models have shown strong potential for text generation, yet standard supervised fine-tuning (SFT) misaligns with their semi-autoregressive inference: training randomly masks tokens across the entire response, while inference generates fixed-size blocks sequentially. This mismatch introduces noisy prefixes and leaky suffixes, biasing gradients away from the desired blockwise likelihood. We propose Blockwise SFT, which partitions responses into fixed-size blocks, selects one active block per step for stochastic masking, freezes all preceding tokens, and fully hides future ones. Loss is computed only over the active block, directly mirroring the blockwise decoding process. Experiments on GSM8K, MATH, and MetaMathQA show consistent gains over classical SFT under equal compute or token budgets. Block size consistency studies and ablations confirm that improvements stem from faithful training-inference alignment rather than incidental masking effects. Our results highlight the importance of matching supervision granularity to the decoding procedure in diffusion-based language models. |
Paper | Under review in ICLR'26 |
| 2025-09-08 | Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models | - | - | |
| 2025-09-12 | Inpainting-Guided Policy Optimization for Diffusion Large Language Models | - | - | |
| 2025-09-20 | Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed | Full AbstractThe token-by-token decoding nature of autoregressive (AR) language models limits their generation throughput, especially in common memory-constrained scenarios. To address this, diffusion language models (dLMs) have emerged as a promising paradigm to enable parallel, non-autoregressive generation for higher throughput. However, existing dLMs have either failed to deliver faster speeds than AR models or have been restricted to small model scales due to high training costs, resulting in limited capability. To this end, we build on pretrained AR models and develop a training framework to convert them into dLMs that excel in speed. First, we introduce a continuous pretraining scheme with a block-wise attention pattern that remains causal across blocks while enabling bidirectional modeling within each block, which we find to better preserve pretrained models' abilities than the fully bidirectional modeling used in prior work such as Dream. Second, to mitigate the training–test gap in mask token distributions, we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens. Leveraging this framework, we conduct extensive studies of dLMs’ attention patterns, training dynamics, and other design choices, providing actionable insights into scalable AR-to-dLM conversion. We also deliver the Efficient-DLM model family, which outperforms state-of-the-art AR models and dLMs with better accuracy–throughput trade-offs, e.g., Efficient-DLM 4B achieves +1.88% higher accuracy with 4.63x throughput compared to Dream 7B, and +7.79% accuracy with 1.82x throughput compared to Qwen3 1.7B. |
Under review in ICLR'26 | |
| 2025-09-27 | Planner Aware Path Learning in Diffusion Language Models Training | Full AbstractDiffusion language models have emerged as a powerful alternative to autoregressive models, enabling fast inference through flexible and parallel generation paths. This flexibility is enabled by new sampling strategies, or planners, that iteratively choose where to denoise along the sequence rather than sampling uniformly at random. However, by modifying reverse paths, planners introduce a mismatch between the uniformly random denoising paths used during training and the planning-based paths used at inference. In this work, we systematically investigate this mismatch and theoretically show that the standard discrete diffusion training evidence lower bound (ELBO) does not accurately describe a denoiser under non-uniform planning. To bridge this gap, we derive a new Planned Evidence Lower Bound (P-ELBO) that directly incorporates planner-based reverse dynamics into the training objective. Building on this, we propose Planner Aware Path Learning (PAPL), a simple and effective modification of the standard masked discrete diffusion loss that aligns training and inference under planned denoisers. Empirically, PAPL delivers consistent improvements across domains, including a 40% relative gain in protein sequence modeling, up to a 4x improvement in MAUVE for text generation, and a 23% relative gain in HumanEval pass@10 for code generation. |
Paper | Under review in ICLR'26 |
| 2025-09-20 | Consistent Diffusion Language Models | Full AbstractDiffusion-based language models (DLMs) have emerged as compelling alternatives to sequential autoregressive generation, offering the promise of parallel decoding. Yet existing discrete diffusion models require hundreds of refinement steps for high-quality text, undermining the efficiency gains of parallelism. We introduce the Consistent Diffusion Language Model (CDLM), a new family of generative models that brings the benefits of consistency training---enforcing agreement across noise levels to enable one- or few-step generation---to the discrete domain. Our approach leverages an exact closed-form formulation of discrete posteriors, providing a rigorous analogue to the missing probability-flow ODE in discrete space. This yields a multi-path consistency objective that, as we show, unifies and generalizes popular diffusion, consistency, and distillation methods in a single view. To ensure stability at scale, we introduce a set of principled design choices that prevent training pathologies like mode collapse. On conditional and unconditional text-generation benchmarks, CDLM establishes new state of the art as a single-stage model, consistently outperforming both base and distilled DLMs across sampling budgets. These results position CDLM as a new paradigm for efficient, scalable, and high-fidelity discrete generative modeling. We will be updating the code base under https://anonymous.4open.science/r/dlm-135B |
Paper | Under review in ICLR'26 |
| 2025-09-28 | d2: Improved Techniques for Training Reasoning Diffusion Language Models | - | Paper | Under review in ICLR'26 |
| 2025-09-28 | SparseD: Sparse Attention for Diffusion Language Models | Full AbstractWhile diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention's quadratic complexity with respect to context length in computing all query-key pairs. Intuitively, to reduce this complexity, a natural strategy is to restrict attention to sparse patterns that retain only the most relevant connections. Such approaches are well-established in ARs, where attention follows fixed and clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity behaviors: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These findings render sparse attention methods designed for ARs largely incompatible with DLMs, as they fail to capture head-specific structures and risk degrading generation when applied in early denoising steps. To address these challenges, we propose SparseD, a novel sparse attention method for DLMs. Leveraging the observations, SparseD only requires pre-computing head-specific sparse patterns one time, and reuses them across all steps. This prevents recomputing sparse patterns at each denoising step. Meanwhile, SparseD uses full attention in the early steps, then switches to sparse attention later to maintain generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to speedup over FlashAttention at a 64k context length with 1,024 denoising steps. |
Paper | Under review in ICLR'26 |
| 2025-09-28 | Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models | Full AbstractMask-based Diffusion Language Models (DLMs) struggle to revise incorrect tokens: once a token is generated, it typically remains fixed. The key challenge is to identify potential errors in the inputs. In this paper, we propose \emph{\underline{Rem}asking-\underline{e}nabled \underline{Di}ffusion Language Model (RemeDi}, a mask-based DLM that introduces \emph{remasking} as another fundamental mechanism, enabling more flexible text refinement in diffusion-based text generation. To achieve this, RemeDi jointly predicts token distributions and per-token confidence scores at each step. The confidence scores determine which tokens to be unmasked after the current step, allowing the model to identify tokens with low quality and remask them. These remasked tokens can be resampled with richer context in subsequent steps. We design a remask-aware pipeline to train this ability, including supervised fine-tuning which teaches the model to detect and remask incorrect tokens in addition to predict mask tokens, and reinforcement learning which optimizes full generation trajectories toward higher rewards. Experiments show that RemeDi achieves the state-of-the-art results among open-source DLMs on multiple datasets. |
Paper | Under review in ICLR'26 |
| 2025-10-09 | Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization | Full AbstractDiffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning to DLMs remains an open challenge because of the intractable likelihood. Pioneering work such as diffu-GRPO estimated token-level likelihoods via one-step unmasking. While computationally efficient, this approach is severely biased. A more principled foundation lies in sequence-level likelihoods, where the evidence lower bound (ELBO) serves as a surrogate. Yet, despite this clean mathematical connection, ELBO-based methods have seen limited adoption due to the prohibitive cost of likelihood evaluation. In this work, we revisit ELBO estimation and disentangle its sources of variance. This decomposition motivates reducing variance through fast, deterministic integral approximations along a few pivotal dimensions. Building on this insight, we introduce Group Diffusion Policy Optimization (GDPO), a new RL algorithm tailored for DLMs. GDPO leverages simple yet effective Semi-deterministic Monte Carlo schemes to mitigate the variance explosion of ELBO estimators under vanilla double Monte Carlo sampling, yielding a provably lower-variance estimator under tight evaluation budgets. Empirically, GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO, one of the state-of-the-art baselines, on the majority of math, reasoning, and coding benchmarks. |
Paper | Under review in ICLR'26 |
| 2025-10-13 | SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models | - | Paper | - |
| 2025-10-14 | Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models | - | Paper | - |
| 2025-10-20 | Soft-Masked Diffusion Language Models | Full AbstractDiffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top- predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior, preserving context from earlier computations and allowing partial information about masked tokens to propagate beyond a single step. We propose a training methodology that adapts a pretrained masked diffusion language model to incorporate SM. We demonstrate that continuing pretraining a 169M parameter model with SM leads to improved perplexity and MAUVE scores. Furthermore, we finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM. SM consistently improves performance across multiple coding benchmarks, particularly in high-throughput settings. |
Paper | Under review in ICLR'26 |
| 2025-10-23 | Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding | - | Paper | - |
| 2025-10-24 | MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization | - | Paper | NeurIPS 2025 |
| 2025-10-26 | Encoder-Decoder Diffusion Language Models for Efficient Training and Inference | - | Paper | NeurIPS 2025 |
| 2025-11-24 | CDLM: Consistency Diffusion Language Models For Faster Sampling | - | Paper | - |
| 2025-11-26 | Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models | Full AbstractMasked Diffusion Language Models (MDLMs) have recently emerged as a promising alternative to Autoregressive Language Models (ARLMs), leveraging a denoising objective that, in principle, should enable more uniform context utilisation. In this work, we examine the context comprehension abilities of MDLMs and uncover two key limitations. First, despite their more global training objective, similarly to ARLMS, MDLMs exhibit a strong locality bias: performance is highly sensitive to the position of relevant information within the input, favouring local over distant context. Second, we show that appending a large number of mask tokens--required for generation--can significantly degrade context comprehension. Through systematic ablations, we find that these masks act as distractors, reducing the model's ability to process relevant information. To address this, we introduce a mask-agnostic loss function that encourages predictions to remain invariant to the number of appended masks. Fine-tuning with this objective substantially mitigates the distracting effect of masks, improving robustness of MDLMs. Overall, our findings reveal critical limitations of the current MDLM training paradigm and provide actionable insights for building diffusion-based language models with stronger context comprehension. |
Paper | Under review in ICLR'26 |
| 2025-11-27 | C^2DLM: Causal Concept-Guided Diffusion Large Language Models | - | Paper | - |
| 2025-11-29 | EDIT: Early Diffusion Inference Termination for dLLMs Based on Dynamics of Training Gradients | - | Paper | - |
| 2025-11-26 | Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models | - | Paper | - |
| 2025-12-02 | Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules | - | Paper | - |
| 2025-12-03 | Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective | Full AbstractReinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at https://github.com/ML-GSAI/ESPO. |
Paper | - |
| 2025-12-09 | Learning Unmasking Policies for Diffusion Language Models | Full AbstractDiffusion (Large) Language Models (dLLMs) now match the downstream performance of their autoregressive counterparts on many tasks, while holding the promise of being more efficient during inference. One particularly successful variant is masked discrete diffusion, in which a buffer filled with special mask tokens is progressively replaced with tokens sampled from the model's vocabulary. Efficiency can be gained by unmasking several tokens in parallel, but doing too many at once risks degrading the generation quality. Thus, one critical design aspect of dLLMs is the sampling procedure that selects, at each step of the diffusion process, which tokens to replace. Indeed, recent work has found that heuristic strategies such as confidence thresholding lead to both higher quality and token throughput compared to random unmasking. However, such heuristics have downsides: they require manual tuning, and we observe that their performance degrades with larger buffer sizes. In this work, we instead propose to train sampling procedures using reinforcement learning. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM serves as the environment, and propose a lightweight policy architecture based on a single-layer transformer that maps dLLM token confidences to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive generation, while outperforming them in the full diffusion setting. We also examine the transferability of these policies, finding that they can generalize to new underlying dLLMs and longer sequence lengths. However, we also observe that their performance degrades when applied to out-of-domain data, and that fine-grained tuning of the accuracy-efficiency trade-off can be challenging with our approach. |
Paper | - |
| 2025-12-10 | d-TreeRPO: Towards More Reliable Policy Optimization for Diffusion Language Models | Full AbstractReliable reinforcement learning (RL) for diffusion large language models (dLLMs) requires both accurate advantage estimation and precise estimation of prediction probabilities. Existing RL methods for dLLMs fall short in both aspects: they rely on coarse or unverifiable reward signals, and they estimate prediction probabilities without accounting for the bias relative to the true, unbiased expected prediction probability that properly integrates over all possible decoding orders. To mitigate these issues, we propose \emph{d}-TreeRPO, a reliable RL framework for dLLMs that leverages tree-structured rollouts and bottom-up advantage computation based on verifiable outcome rewards to provide fine-grained and verifiable step-wise reward signals. When estimating the conditional transition probability from a parent node to a child node, we theoretically analyze the estimation error between the unbiased expected prediction probability and the estimate obtained via a single forward pass, and find that higher prediction confidence leads to lower estimation error. Guided by this analysis, we introduce a time-scheduled self-distillation loss during training that enhances prediction confidence in later training stages, thereby enabling more accurate probability estimation and improved convergence. Experiments show that \emph{d}-TreeRPO outperforms existing baselines and achieves significant gains on multiple reasoning benchmarks, including +86.2 on Sudoku, +51.6 on Countdown, +4.5 on GSM8K, and +5.3 on Math500. Ablation studies and computational cost analyses further demonstrate the effectiveness and practicality of our design choices. |
Paper | - |
| 2025-12-15 | ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding | Full AbstractAutoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup. |
Paper | - |
| 2025-12-17 | Corrective Diffusion Language Models | Full AbstractDiffusion language models are structurally well-suited for iterative error correction, as their non-causal denoising dynamics allow arbitrary positions in a sequence to be revised. However, standard masked diffusion language model (MDLM) training fails to reliably induce this behavior, as models often cannot identify unreliable tokens in a complete input, rendering confidence-guided refinement ineffective. We study corrective behavior in diffusion language models, defined as the ability to assign lower confidence to incorrect tokens and iteratively refine them while preserving correct content. We show that this capability is not induced by conventional masked diffusion objectives and propose a correction-oriented post-training principle that explicitly supervises visible incorrect tokens, enabling error-aware confidence and targeted refinement. To evaluate corrective behavior, we introduce the Code Revision Benchmark (CRB), a controllable and executable benchmark for assessing error localization and in-place correction. Experiments on code revision tasks and controlled settings demonstrate that models trained with our approach substantially outperform standard MDLMs in correction scenarios, while also improving pure completion performance. Our code is publicly available at https://github.com/zhangshuibai/CDLM. |
Paper | - |
| 2025-12-24 | dUltra: Ultra-Fast Diffusion Language Models via Reinforcement Learning | Full AbstractMasked diffusion language models (MDLMs) offer the potential for parallel token generation, but most open-source MDLMs decode fewer than 5 tokens per model forward pass even with sophisticated sampling strategies. As a result, their sampling speeds are often comparable to AR + speculative decoding schemes, limiting their advantage over mainstream autoregressive approaches. Existing distillation-based accelerators (dParallel, d3LLM) finetune MDLMs on trajectories generated by a base model, which can become off-policy during finetuning and restrict performance to the quality of the base model's samples. We propose \texttt{dUltra}, an on-policy reinforcement learning framework based on Group Relative Policy Optimization (GRPO) that learns unmasking strategies for efficient parallel decoding. dUltra introduces an unmasking planner head that predicts per-token unmasking likelihoods under independent Bernoulli distributions. We jointly optimize the base diffusion LLM and the unmasking order planner using reward signals combining verifiable reward, distillation reward, and the number of unmasking steps. Across mathematical reasoning and code generation tasks, dUltra improves the accuracy--efficiency trade-off over state-of-the-art heuristic and distillation baselines, moving towards achieving ``diffusion supremacy'' over autoregressive models. |
Paper | - |
| 2025-12-25 | Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model | Full AbstractRecently, Masked Diffusion Models (MDMs) have shown promising potential across vision, language, and cross-modal generation. However, a notable discrepancy exists between their training and inference procedures. In particular, MDM inference is a multi-step, iterative process governed not only by the model itself but also by various schedules that dictate the token-decoding trajectory (e.g., how many tokens to decode at each step). In contrast, MDMs are typically trained using a simplified, single-step BERT-style objective that masks a subset of tokens and predicts all of them simultaneously. This step-level simplification fundamentally disconnects the training paradigm from the trajectory-level nature of inference, leaving the inference schedules never optimized during training. In this paper, we introduce Co-GRPO, which reformulates MDM generation as a unified Markov Decision Process (MDP) that jointly incorporates both the model and the inference schedule. By applying Group Relative Policy Optimization at the trajectory level, Co-GRPO cooperatively optimizes model parameters and schedule parameters under a shared reward, without requiring costly backpropagation through the multi-step generation process. This holistic optimization aligns training with inference more thoroughly and substantially improves generation quality. Empirical results across four benchmarks-ImageReward, HPS, GenEval, and DPG-Bench-demonstrate the effectiveness of our approach. For more details, please refer to our project page: https://co-grpo.github.io/ . |
Paper | - |
| 2025-12-23 | DiRL: An Efficient Post-Training Framework for Diffusion Language Models | Full AbstractDiffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models. While recent efforts have validated their pre-training potential and accelerated inference speeds, the post-training landscape for dLLMs remains underdeveloped. Existing methods suffer from computational inefficiency and objective mismatches between training and inference, severely limiting performance on complex reasoning tasks such as mathematics. To address this, we introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference. This architecture enables a streamlined online model update loop, facilitating efficient two-stage post-training (Supervised Fine-Tuning followed by Reinforcement Learning). Building on this framework, we propose DiPO, the first unbiased Group Relative Policy Optimization (GRPO) implementation tailored for dLLMs. We validate our approach by training DiRL-8B-Instruct on high-quality math data. Our model achieves state-of-the-art math performance among dLLMs and surpasses comparable models in the Qwen2.5 series on several benchmarks. |
Paper | - |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-05-30 | DLM-One: Diffusion Language Models for One-Step Sequence Generation | Full AbstractThis paper introduces DLM-One, a score-distillation-based framework for one-step sequence generation with continuous diffusion language models (DLMs). DLM-One eliminates the need for iterative refinement by aligning the scores of a student model's outputs in the continuous token embedding space with the score function of a pretrained teacher DLM. We investigate whether DLM-One can achieve substantial gains in sampling efficiency for language modeling. Through comprehensive experiments on DiffuSeq -- a representative continuous DLM -- we show that DLM-One achieves up to ~500x speedup in inference time while maintaining competitive performance on benchmark text generation tasks used to evaluate the teacher models. We further analyze the method's empirical behavior across multiple datasets, providing initial insights into its generality and practical applicability. Our findings position one-step diffusion as a promising direction for efficient, high-quality language generation and broader adoption of continuous diffusion models operating in embedding space for natural language processing. |
Paper | Under review in ICLR'26 |
| 2025-09-02 | OneFlowSeq: Achieving One-Step Generation for Diffusion Language Models via Lightweight Distillation | Full AbstractAutoregressive models dominate Seq2Seq generation but suffer from slow, error-prone token-by-token decoding. Diffusion language models (DLMs) enable parallel refinement and global coherence, yet their iterative denoising requires hundreds of steps, limiting practicality. We propose OneFlowSeq, a novel framework that distills a powerful multi-step diffusion teacher (LLaDA-8B-Instruct) into a one-step generator via MeanFlow-based supervision and parameter-efficient prompt tuning. Our OneFlowSeq introduces a Jacobian-vector product signal that provides richer guidance than conventional distillation, allowing the student to not only match the 128-step teacher in terms of one-step generation quality. Experiments on paraphrasing, text simplification, and question generation benchmarks show that OneFlowSeq achieves state-of-the-art performance, while reducing trainable parameters by 1600x and delivering inference speeds orders of magnitude faster than both autoregressive and multi-step diffusion baselines. This work establishes one-step diffusion as a practical and scalable paradigm for Seq2Seq generation. |
Paper | Under review in ICLR'26 |
| 2025-09-20 | Dual Distillation of Trajectory and Guidance Knowledge for Faster Inference in Conditional Masked Diffusion Language Models | Full AbstractMasked diffusion language models (MDLMs) have emerged as a promising generative framework for natural language, owing to parallel non-autoregressive generation capabilities with iterative unmasking/denoising. However, typical MDLMs require a very large number of neural network function evaluations for effective inference, making them computationally expensive in many real-world NLP applications that rely on conditional sequence-to-sequence generation. In this work, we propose a two-stage distillation method for conditional MDLMs that distills knowledge of (i) classifier-free guidance as well as (ii) unmasking trajectory from the existing teacher MDLM into a student MDLM. This allows the student MDLM, during inference, to (i) reduce two forward passes, required by a classifier-free guided (teacher) MDLM, to a single pass, and (ii) drastically reduce the number of unmasking steps. In this way, by dual distillation of guidance and trajectory knowledge, our MDLM achieves speedups of up to 16x while virtually retaining the quality of generation. |
Paper | Under review in ICLR'26 |
| 2026-01-05 | CD4LM: Consistency Distillation and aDaptive Decoding for Diffusion Language Models | Full AbstractAutoregressive large language models achieve strong results on many benchmarks, but decoding remains fundamentally latency-limited by sequential dependence on previously generated tokens. Diffusion language models (DLMs) promise parallel generation but suffer from a fundamental static-to-dynamic misalignment: Training optimizes local transitions under fixed schedules, whereas efficient inference requires adaptive "long-jump" refinements through unseen states. Our goal is to enable highly parallel decoding for DLMs with low number of function evaluations while preserving generation quality. To achieve this, we propose CD4LM, a framework that decouples training from inference via Discrete-Space Consistency Distillation (DSCD) and Confidence-Adaptive Decoding (CAD). Unlike standard objectives, DSCD trains a student to be trajectory-invariant, mapping diverse noisy states directly to the clean distribution. This intrinsic robustness enables CAD to dynamically allocate compute resources based on token confidence, aggressively skipping steps without the quality collapse typical of heuristic acceleration. On GSM8K, CD4LM matches the LLaDA baseline with a 5.18x wall-clock speedup; across code and math benchmarks, it strictly dominates the accuracy-efficiency Pareto frontier, achieving a 3.62x mean speedup while improving average accuracy. Code is available at https://github.com/yihao-liang/CDLM |
Paper | - |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-05-22 | LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning | - | Paper | - |
| 2025-05-22 | LaViDa: A Large Diffusion Language Model for Multimodal Understanding | - | Paper | - |
| 2025-05-22 | Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding | - | Paper | - |
| 2025-10-30 | Masked Diffusion Captioning for Visual Feature Learning | - | Paper | - |
| 2025-11-12 | MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation | Full AbstractWhile thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. The model is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our approach significantly improves cross-modal alignment and semantic consistency, achieving a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. |
Paper | Under review in ICLR'26 |
| 2025-12-04 | dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning | Full AbstractThe autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving. |
Paper | - |
| 2025-12-17 | DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models | Full AbstractIn recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL. |
Paper | - |
| 2025-12-27 | Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone | Full AbstractWhile autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as |
Paper | - |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-05-21 | MMaDA: Multimodal Large Diffusion Language Models | - | Paper | NeurIPS 2025 |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-08-09 | Whisfusion: Parallel ASR Decoding via a Diffusion Transformer | - | Paper | - |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2023-05-24 | David helps Goliath: Inference-Time Collaboration Between Small Specialized and Large General Diffusion LMs | - | Paper | - |
| 2025-06-10 | Edit Flows: Flow Matching with Edit Operations | - | Paper | - |
| 2025-07-15 | DreamOn: Diffusion Language Models For Code Infilling Beyond Fixed-Size Canvas | - | Paper | Under review in ILCR'26 |
| 2025-08-04 | Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models | - | Paper | - |
| 2025-08-31 | Any-Order Flexible Length Masked Diffusion | - | - | |
| 2025-10-28 | Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way | - | Paper | - |
| 2025-09-28 | Sequential Diffusion Language Models | Full AbstractDiffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value (KV) caches. Block diffusion mitigates these issues, yet still enforces a fixed block size and requires expensive training. We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction, enabling the model to adaptively determine the generation length at each step. When the length is fixed to 1, NSP reduces to standard next-token prediction. Building on NSP, we propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost. Specifically, SDLM performs diffusion inference within fixed-size mask blocks, but dynamically decodes consecutive subsequences based on model confidence, thereby preserving KV-cache compatibility and improving robustness to varying uncertainty and semantics across the sequence. Experiments show that SDLM matches or surpasses strong autoregressive baselines using only 3.5M training samples, while achieving 2.1 higher throughput than Qwen-2.5. Notably, the SDLM-32B model delivers even more pronounced efficiency gains, demonstrating the strong scalability potential of our modeling paradigm. Project page and codes: this https URL |
Paper | Under review in ICLR'26 |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-08-12 | Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models | Full AbstractDiffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them. |
Paper | Under review in ICLR'26 |
| 2025-08-14 | Thinking Inside the Mask: In-Place Prompting in Diffusion LLMs | - | - | |
| 2025-10-08 | Symbolic-Diffusion: Deep Learning Based Symbolic Regression with D3PM Discrete Token Diffusion | - | Paper | - |
| 2025-09-26 | Unveiling the Potential of Diffusion Large Language Model in Controllable Generation | - | Paper | - |
| 2025-10-17 | Attention Sinks in Diffusion Language Models | - | Paper | - |
| 2025-10-30 | Don't Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation | - | Paper | NeurIPS 2025 |
| 2025-10-31 | Diffusion LLMs are Natural Adversaries for any LLM | - | Paper | - |
| 2025-11-11 | DiffuGR: Generative Document Retrieval with Diffusion Language Models | - | Paper | - |
| 2025-11-12 | Branching Flows: Discrete, Continuous, and Manifold Flow Matching with Splits and Deletions | - | Paper | - |
| 2025-11-26 | Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium | - | Paper | - |
| 2025-09-19 | STEAD: Robust Provably Secure Linguistic Steganography with Diffusion Language Model | - | Paper | NeurIPS 2025 |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-09-20 | Membership Inference Attacks Against Fine-tuned Diffusion Language Models | Full AbstractDiffusion Language Models (DLMs) represent a promising alternative to autoregressive language models, using bidirectional masked token prediction. Yet their susceptibility to privacy leakage via Membership Inference Attacks (MIA) remains critically underexplored. This paper presents the first systematic investigation of MIA vulnerabilities in DLMs. Unlike the autoregressive models' single fixed prediction pattern, DLMs' multiple maskable configurations exponentially increase attack opportunities. This ability to probe many independent masks dramatically improves detection chances. To exploit this, we introduce SAMA (Subset-Aggregated Membership Attack), which addresses the sparse signal challenge through robust aggregation. SAMA samples masked subsets across progressive densities and applies sign-based statistics that remain effective despite heavy-tailed noise. Through inverse-weighted aggregation prioritizing sparse masks' cleaner signals, SAMA transforms sparse memorization detection into a robust voting mechanism. Experiments on nine datasets show SAMA achieves 30% relative AUC improvement over the best baseline, with up to 8x improvement at low false positive rates. These findings reveal significant, previously unknown vulnerabilities in DLMs, necessitating the development of tailored privacy defenses. |
Paper | Under review in ICLR'26 |
| 2025-09-29 | Watermarking Diffusion Language Models | - | Paper | - |
| 2025-10-01 | Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability | Full AbstractDiffusion language models (DLMs) generate tokens in parallel through iterative denoising, which can reduce latency and enable bidirectional conditioning. However, the safety risks posed by jailbreak attacks that exploit this inference mechanism are not well understood. In this paper, we reveal that DLMs have a critical vulnerability stemming from their iterative denoising process and propose a countermeasure. Specifically, our investigation shows that if an affirmative token for a harmful query appears at an intermediate step, subsequent denoising can be steered toward a harmful response even in aligned models. As a result, simply injecting such affirmative tokens can readily bypass the safety guardrails. Furthermore, we demonstrate that the vulnerability allows existing optimization-based jailbreak attacks to succeed on DLMs. Building on this analysis, we propose a novel safety alignment method tailored to DLMs that trains models to generate safe responses from contaminated intermediate states that contain affirmative tokens. Our experiments indicate that the proposed method significantly mitigates the vulnerability with minimal impact on task performance. Furthermore, our method improves robustness against conventional jailbreak attacks. Our work underscores the need for DLM-specific safety research. |
Paper | Under review in ICLR'26 |
| 2025-11-03 | Watermarking Discrete Diffusion Language Models | - | Paper | - |
| 2025-11-03 | Watermarking Discrete Diffusion Language Models | - | Paper | - |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-06-17 | LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs | - | Paper | - |
| 2025-09-18 | Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning | - | Paper | NeurIPS 2025 |
| 2025-12-23 | MoE-DiffuSeq: Enhancing Long-Document Diffusion Models with Sparse Attention and Mixture of Experts | Full AbstractWe present MoE-DiffuSeq, a mixture of experts based framework for enhancing diffusion models in long document generation. Existing diffusion based text generation models, such as DiffuSeq, suffer from high computational cost and memory overhead when applied to extended sequences. To address these challenges, MoE-DiffuSeq integrates sparse attention with a mixture of experts architecture, enabling efficient and scalable long sequence modeling. Our approach introduces a customized sparse attention mechanism designed to reduce computational complexity while preserving text quality and coherence. In addition, we incorporate a soft absorbing state within the diffusion process to accelerate sequence reconstruction and improve generation precision. Extensive experiments demonstrate that MoE-DiffuSeq significantly improves training efficiency and sampling speed compared to existing diffusion models. These advantages are particularly effective for long document scenarios, including scientific article generation, code repository modeling, and long form dialogue generation. Benchmark results further show that MoE-DiffuSeq improves efficiency, speed, accuracy, and expressiveness, advancing the practical applicability of diffusion models for high quality long form text generation. |
Paper | - |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-09-27 | A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models | Full AbstractDiffusion large language models (dLLMs) enable any-order generation, but this flexibility enlarges the attack surface: harmful spans may appear at arbitrary positions, and template-based prefilling attacks such as DIJA bypass response-level refusals. We introduce A2D (Any-Order, Any-Step Defense), a token-level alignment method that aligns dLLMs to emit an [EOS] refusal signal whenever harmful content arises. By aligning safety directly at the token-level under randomized masking, A2D achieves robustness to both any-decoding-order and any-step prefilling attacks under various conditions. It also enables real-time monitoring: dLLMs may begin a response but automatically terminate if unsafe continuation emerges. On safety benchmarks, A2D consistently prevents the generation of harmful outputs, slashing DIJA success rates from over 80% to near-zero (1.3% on LLaDA-8B-Instruct, 0.0% on Dream-v0-Instruct-7B), and thresholded [EOS] probabilities allow early rejection, yielding up to 19.3x faster safe termination. |
Paper | Under reciew in ICLR'26 |
| 2025-10-26 | Aligning Diffusion Language Models via Unpaired Preference Optimization | - | Paper | - |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-11-28 | Masked Diffusion for Generative Recommendation | - | Paper | - |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-12-17 | DEER: Draft with Diffusion, Verify with Autoregressive Models | Full AbstractEfficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to a progressive collapse of trust between the target model and the drafter, and (2) inherently sequential decoding of AR drafters. Together, these factors cause limited speedups. In this paper, we show that a diffusion large language model (dLLM) drafters can naturally overcome these issues through its fundamentally different probabilistic modeling and efficient parallel decoding strategy. Building on this insight, we introduce DEER, an efficient speculative decoding framework that drafts with diffusion and verifies with AR models. To enable high-quality drafting, DEER employs a two-stage training pipeline to align the dLLM-based drafters with the target AR model, and further adopts single-step decoding to generate long draft segments. Experiments show DEER reaches draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. Moreover, on HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x. Code, model, demo, etc, will be available at https://czc726.github.io/DEER/ |
Paper | - |
| 2025-12-23 | Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs | Full AbstractDiffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM's speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It "fails fast" by spending minimal compute in hard-to-speculate regions to shrink speculation latency and "wins big" by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$\times$ speedup over vanilla decoding, 1.7$\times$ over the best naive dLLM drafter, and 1.4$\times$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast. |
Paper | - |
| Date | Title | Abstract | Link | Remark |
|---|---|---|---|---|
| 2025-11-14 | LiteAttention: A Temporal Sparse Attention for Diffusion Transformers | - | Paper | - |
| 2025-11-18 | Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model | - | Paper | - |
| 2025-11-19 | Masked Auto-Regressive Variational Acceleration: Fast Inference Makes Practical Reinforcement Learning | - | Paper | - |
| 2025-11-24 | DiP: Taming Diffusion Models in Pixel Space | - | Paper | - |
| 2025-11-27 | Test-time scaling of diffusions with flow maps | - | Paper | - |
| 2025-12-01 | Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe | - | Paper | - |
| 2025-12-03 | Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation | Full AbstractTest-time scaling (TTS) aims to achieve better results by increasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffusion models, most related works focus on search strategies and reward models, yet the impact of the stochastic characteristic of noise in T2I diffusion models on the method's performance remains unexplored. In this work, we analyze the effects of randomness in T2I diffusion models and explore a new format of randomness for TTS: text embedding perturbation, which couples with existing randomness like SDE-injected noise to enhance generative diversity and quality. We start with a frequency-domain analysis of these formats of randomness and their impact on generation, and find that these two randomness exhibit complementary behavior in the frequency domain: spatial noise favors low-frequency components (early steps), while text embedding perturbation enhances high-frequency details (later steps), thereby compensating for the potential limitations of spatial noise randomness in high-frequency manipulation. Concurrently, text embedding demonstrates varying levels of tolerance to perturbation across different dimensions of the generation process. Specifically, our method consists of two key designs: (1) Introducing step-based text embedding perturbation, combining frequency-guided noise schedules with spatial noise perturbation. (2) Adapting the perturbation intensity selectively based on their frequency-specific contributions to generation and tolerance to perturbation. Our approach can be seamlessly integrated into existing TTS methods and demonstrates significant improvements on multiple benchmarks with almost no additional computation. Code is available at \href{https://github.com/xuhang07/TEP-Diffusion}{https://github.com/xuhang07/TEP-Diffusion}. |
Paper | - |
| 2025-12-15 | Few-Step Distillation for Text-to-Image Generation: A Practical Guide | Full AbstractDiffusion distillation has dramatically accelerated class-conditional image synthesis, but its applicability to open-ended text-to-image (T2I) generation is still unclear. We present the first systematic study that adapts and compares state-of-the-art distillation techniques on a strong T2I teacher model, FLUX.1-lite. By casting existing methods into a unified framework, we identify the key obstacles that arise when moving from discrete class labels to free-form language prompts. Beyond a thorough methodological analysis, we offer practical guidelines on input scaling, network architecture, and hyperparameters, accompanied by an open-source implementation and pretrained student models. Our findings establish a solid foundation for deploying fast, high-fidelity, and resource-efficient diffusion generators in real-world T2I applications. Code is available on github.com/alibaba-damo-academy/T2I-Distill. |
Paper | - |
| 2025-12-29 | ThinkGen: Generalized Thinking for Visual Generation | Full AbstractRecent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: https://github.com/jiaosiyuu/ThinkGen |
Paper | - |
| 2025-12-29 | Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision | Full AbstractDiffusion models have achieved impressive results in generative tasks such as text-to-image synthesis, yet they often struggle to fully align outputs with nuanced user intent and maintain consistent aesthetic quality. Existing preference-based training methods like Diffusion Direct Preference Optimization help address these issues but rely on costly and potentially noisy human-labeled datasets. In this work, we introduce Direct Diffusion Score Preference Optimization (DDSPO), which directly derives per-timestep supervision from winning and losing policies when such policies are available. Unlike prior methods that operate solely on final samples, DDSPO provides dense, transition-level signals across the denoising trajectory. In practice, we avoid reliance on labeled data by automatically generating preference signals using a pretrained reference model: we contrast its outputs when conditioned on original prompts versus semantically degraded variants. This practical strategy enables effective score-space preference supervision without explicit reward modeling or manual annotations. Empirical results demonstrate that DDSPO improves text-image alignment and visual quality, outperforming or matching existing preference-based methods while requiring significantly less supervision. Our implementation is available at: https://dohyun-as.github.io/DDSPO |
Paper | - |
| 2025-12-28 | Guided Path Sampling: Steering Diffusion Models Back on Track with Principled Path Guidance | Full AbstractIterative refinement methods based on a denoising-inversion cycle are powerful tools for enhancing the quality and control of diffusion models. However, their effectiveness is critically limited when combined with standard Classifier-Free Guidance (CFG). We identify a fundamental limitation: CFG's extrapolative nature systematically pushes the sampling path off the data manifold, causing the approximation error to diverge and undermining the refinement process. To address this, we propose Guided Path Sampling (GPS), a new paradigm for iterative refinement. GPS replaces unstable extrapolation with a principled, manifold-constrained interpolation, ensuring the sampling path remains on the data manifold. We theoretically prove that this correction transforms the error series from unbounded amplification to strictly bounded, guaranteeing stability. Furthermore, we devise an optimal scheduling strategy that dynamically adjusts guidance strength, aligning semantic injection with the model's natural coarse-to-fine generation process. Extensive experiments on modern backbones like SDXL and Hunyuan-DiT show that GPS outperforms existing methods in both perceptual quality and complex prompt adherence. For instance, GPS achieves a superior ImageReward of 0.79 and HPS v2 of 0.2995 on SDXL, while improving overall semantic alignment accuracy on GenEval to 57.45%. Our work establishes that path stability is a prerequisite for effective iterative refinement, and GPS provides a robust framework to achieve it. |
Paper | - |
| 2026-01-04 | Guiding Token-Sparse Diffusion Models | Full AbstractDiffusion models deliver high quality in image synthesis but remain expensive during training and inference. Recent works have leveraged the inherent redundancy in visual content to make training more affordable by training only on a subset of visual information. While these methods were successful in providing cheaper and more effective training, sparsely trained diffusion models struggle in inference. This is due to their lacking response to Classifier-free Guidance (CFG) leading to underwhelming performance during inference. To overcome this, we propose Sparse Guidance (SG). Instead of using conditional dropout as a signal to guide diffusion models, SG uses token-level sparsity. As a result, SG preserves the high-variance of the conditional prediction better, achieving good quality and high variance outputs. Leveraging token-level sparsity at inference, SG improves fidelity at lower compute, achieving 1.58 FID on the commonly used ImageNet-256 benchmark with 25% fewer FLOPs, and yields up to 58% FLOP savings at matched baseline quality. To demonstrate the effectiveness of Sparse Guidance, we train a 2.5B text-to-image diffusion model using training time sparsity and leverage SG during inference. SG achieves improvements in composition and human preference score while increasing throughput at the same time. |
Paper | - |