Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

Paper · arXiv 2508.09192 · Published August 8, 2025
Diffusion-Based LLMs

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source dLLMs have achieved superior inference speed over AR LLMs of similar size. This paper breaks this barrier based on a simple and effective strategy named discrete diffusion forcing (D2F). D2F equips dLLMs with two key capabilities: (1) block-wise autoregressive generation to enable KV cache utilization; (2) prediction of following tokens without requiring completion of prior blocks for interblock parallel decoding. In this way, the vanilla dLLMs are refurbished into an AR-diffusion hybrid paradigm for efficient inference. D2F can be implemented with an asymmetric distillation process based on pre-trained dLLMs. We further propose a pipelined parallel decoding algorithm, which enables a trade-off between efficiency and efficacy. Empirically, D2F dLLMs achieve more than 2.5× inference speed than LLaMA3 and Qwen2.5 on GSM8K. Compared to vanilla dLLMs like LLaDA and Dream, the acceleration can be more than 50× while maintaining comparable output quality. The code is available at https: //github.com/zhijie-group/Discrete-Diffusion-Forcing.

Introduction. Large Language Models (LLMs) have maintained a dominant position in text generation for a long time (Achiam et al., 2023; Touvron et al., 2023a; Yang et al., 2025; Touvron et al., 2023b; Grattafiori et al., 2024). Recently, Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to LLMs (Ye et al., 2025; Nie et al., 2025; Zhu et al., 2025), acknowledged by their potential to generate multiple text tokens in parallel. For example, closed-source dLLMs such as Gemini Diffusion (Google DeepMind, 2025) and Mercury (Inception et al., 2025) can yield thousands of tokens per second, 5-10 times faster than traditional autoregressive (AR) LLMs of similar size. However, the speed merits of dLLMs have not been demonstrated within the open-source community. Approaches to bridge the gap include designing KV cache strategies (Arriola et al., 2025; Liu et al., 2025; Ma et al., 2025) and improving parallel sampling algorithms (Wu et al., 2025; Wei et al., 2025; Hu et al., 2025).

Discussion / Conclusion. In this work, we introduce Discrete Diffusion Forcing (D2F), a novel training paradigm for dLLMs. D2F employs a generation scheme that conditions on partially predicted tokens from previous blocks to predict the next block, thereby supporting KV cache and enabling parallel generation across multiple blocks, resulting in significantly faster inference. Empirically, extensive experiments demonstrate that D2F achieves the milestone of being the first dLLM to support faster-than-AR inference.