SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Model Architecture and Internals Training, RL, and Test-Time Scaling

Can reasoning and answers be generated separately in language models?

Explores whether diffusion LLMs can embed reasoning prompts directly within generation sequences rather than as prefixes, and whether answers and reasoning can be decoupled as independent refinement axes.

Synthesis note · 2026-05-03 · sourced from Diffusion LLM

Autoregressive models can only attach prompts as a prefix because generation proceeds left-to-right — reasoning must be sequentially generated before any answer becomes accessible, which means a CoT prompt has to live entirely in the prefix and the answer becomes available only at the end of decoding. Diffusion LLMs have bidirectional attention and iterative refinement, which structurally permits a different prompting strategy: in-place prompts, embedded directly within masked token positions, refined alongside the rest of the sequence.

ICE (In-Place Chain-of-Thought Prompting with Early Exit) operationalizes this by structuring the generation sequence into two semantically distinct sections — a thinking section and an answer section — with explicit step-by-step reasoning templates embedded in the thinking section as in-place prompts. Both sections are refined simultaneously through the diffusion denoising process, so the model can refine reasoning steps while maintaining awareness of answer regions throughout generation. This is impossible in AR models, where answer content is inaccessible until reasoning completes.

The second contribution exploits a previously unnamed property of dLLM refinement dynamics: confidence in answer tokens converges rapidly to high levels and stays stable, while the reasoning section continues to undergo refinement long after. This means models often determine the correct answer significantly earlier than the explicit reasoning trace stabilizes — a kind of intuitive answer commitment followed by post-hoc reasoning, mirroring the structure of human dual-process cognition (and aligning with Does chain-of-thought reasoning reflect genuine thinking or performance? from the AR side). ICE uses a confidence-aware early-exit mechanism to cut compute by parallel-decoding answer tokens once their confidence has converged, even while reasoning is still being refined.

The structural implication is that in dLLMs, reasoning and answering are decouplable axes of generation rather than a temporally ordered sequence. The reasoning trace can serve roles other than producing the answer — for example, post-hoc justification or interpretability — and the answer can be produced from internal state earlier than the visible reasoning suggests. This breaks the AR-era assumption that visible CoT length is an upper bound on compute spent before answering.

Inquiring lines that use this note as a source 42

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 145 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

in-place prompting in diffusion LLMs eliminates the prefix-only constraint of autoregressive prompting — reasoning embeds within masked positions during refinement