SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals Reasoning, Retrieval, and Evaluation

Can diffusion models commit to answers before full decoding?

Do diffusion language models settle on correct answers early in their refinement process, and if so, can we detect and exploit this convergence to speed up inference without losing quality?

Synthesis note · 2026-05-03 · sourced from Diffusion LLM

Diffusion LMs are slower than AR models at inference, primarily because of the cost of bidirectional attention and the large number of refinement steps required for high-quality outputs. The standard assumption is that more refinement equals better answers, so cutting refinement budget should cost accuracy. This paper documents a counterintuitive empirical property: early answer convergence. In many cases, the correct answer can be internally identified by half the refinement steps before the final decoding step — for GSM8K, up to 97% of instances; for MMLU, up to 99%. The pattern holds under both semi-autoregressive and random remasking schedules.

This reveals a fundamental redundancy in conventional full-length slow decoding. Most of the latter half of decoding is not improving the answer — it is just maintaining an answer the model already settled on. The right framing is that DLM decoding is a stopping problem: when is it safe to commit and emit the answer rather than continuing to refine?

Prophet operationalizes this insight as a training-free fast decoding paradigm that monitors the confidence gap between the top-2 prediction candidates and dynamically decides whether to continue refinement or "go all-in" — decode all remaining tokens in one step. The confidence gap serves as a reliable signal for when the model has internally committed; once it has, additional refinement is wasted compute. The mechanism integrates seamlessly into existing DLM implementations with negligible overhead and requires no additional training.

Empirically on LLaDA-8B and Dream-7B across multiple tasks, Prophet reduces decoding steps by up to 3.4× while preserving generation quality. The structural lesson generalizes beyond DLMs: any iterative-refinement model with monitorable internal confidence has a stopping problem rather than a fixed budget, and treating refinement steps as a hyperparameter rather than a runtime decision leaves substantial compute on the table — the same diagnosis Does reflection in reasoning models actually correct errors? reaches for AR reasoning.

Inquiring lines that use this note as a source 16

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 129 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

diffusion language models know the answer well before decoding completes — up to 99 percent of MMLU instances are correctly resolvable at half the refinement budget