SYNTHESIS NOTE
Model Architecture and Internals Training, RL, and Test-Time Scaling

Can diffusion language models match autoregressive inference speed?

Diffusion LLMs promised faster decoding through parallel token generation, but open-source implementations never outpaced autoregressive models in practice. What architectural barriers prevent diffusion from realizing its speed potential?

Synthesis note · 2026-05-03 · sourced from Diffusion LLM

Diffusion LLMs were initially proposed in part for inference speed — they decode multiple tokens per iteration in principle, suggesting they should outpace autoregressive models. In practice, no open-source dLLM had achieved superior inference speed over AR LLMs of similar size. The paradox is that bidirectional attention, while it enables parallel generation within a step, costs more compute per step and prevents the KV-cache reuse that makes AR inference cheap.

Discrete Diffusion Forcing (D2F) breaks this barrier through a hybrid design that takes the speed advantage from each paradigm. The first capability is block-wise autoregressive generation — generating tokens in blocks rather than as a flat sequence — which permits KV cache reuse across blocks just as in AR models, eliminating the per-step compute overhead that bidirectional attention otherwise imposes. The second capability is prediction of following tokens without requiring completion of prior blocks, which enables inter-block parallel decoding and recovers the parallelism advantage that pure AR cannot offer.

The implementation matters as much as the design. D2F uses an asymmetric distillation process from pre-trained dLLMs, so existing dLLMs can be refurbished into the AR-diffusion hybrid paradigm without training from scratch. A pipelined parallel decoding algorithm provides a configurable trade-off between efficiency and efficacy, allowing deployment to choose its operating point.

The deeper lesson is that the AR-vs-diffusion framing has been a false dichotomy at inference time. The two paradigms decompose generation along different axes — AR along sequence position, diffusion along refinement step — and a hybrid that runs AR along blocks while running diffusion within and across blocks captures both kinds of parallelism. Architectural purity costs throughput; pragmatic hybrids win — convergent with the How should we balance parallel versus sequential compute at test time? pattern that mixed paradigms outperform pure ones.

Inquiring lines that use this note as a source 15

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 107 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

diffusion language models can achieve faster-than-autoregressive inference by hybridizing block-wise AR with inter-block parallelism