INQUIRING LINE

Can diffusion models condition on right context natively without special training for infilling?

This explores whether diffusion language models can fill in missing text using both left and right context as a built-in property of how they generate — rather than needing a dedicated infilling objective the way autoregressive (left-to-right) models do.


This explores whether diffusion language models get infilling "for free" because of how they generate, instead of needing a special training recipe. The corpus suggests the answer is largely yes, and the reason traces back to a single architectural difference: diffusion models refine a whole sequence in parallel rather than predicting one token at a time from left to right. Because every position can attend to every other position (bidirectional attention), the surrounding text on *both* sides is already part of what the model conditions on at each denoising step. Infilling isn't a bolted-on mode — it's just what happens when you leave some positions masked and let the model fill them while reading the context around them.

The clearest evidence is Diffusion-LM, which succeeds on fine-grained control tasks — including infilling — where plug-and-play methods on autoregressive models fail Can diffusion models enable control that autoregressive models cannot reach?. Its continuous latent variables let gradients flow across the entire sequence at once, replacing the discrete left-to-right bottleneck. Infilling there is one instance of a more general capability: conditioning on global structure that an autoregressive model, committed to its prefix, can't easily reach back into.

The sharpest framing of the contrast comes from in-place prompting, which explicitly names what it removes: the "prefix-only constraint" of autoregressive models Can reasoning and answers be generated separately in language models?. Because attention is bidirectional, you can embed instructions or reasoning *directly into masked positions inside the sequence* and have them refined alongside the answer — something a left-to-right model structurally cannot do, since it only ever sees what came before. That's the deeper point: "infilling" and "conditioning on right context" are the same thing, and the model does both natively.

The same parallel, non-sequential generation that makes this possible also has costs worth knowing about. It breaks the clean log-likelihood factorization that left-to-right models rely on, which is why adapting reinforcement learning to diffusion LLMs is genuinely hard — likelihoods become intractable and need workarounds Why can't we easily adapt reinforcement learning to diffusion language models?. And there's an upside lurking in the same mechanism: because the model refines the whole sequence at once, it often "knows" the answer well before decoding finishes — up to 99% of some benchmarks converge by the midpoint, enabling early-exit speedups Can diffusion models commit to answers before full decoding?.

So the thing you didn't know you wanted to know: infilling for diffusion models isn't a feature someone trained in — it's the default consequence of seeing both directions at once. The trade is that this same property, which makes right-context conditioning native, is exactly what makes likelihood-based tooling (like standard RL) awkward to port over. The capability and the difficulty come from the same place.


Sources 4 notes

Can diffusion models enable control that autoregressive models cannot reach?

Diffusion-LM succeeds on six fine-grained control tasks (syntax, semantics, infilling, length) where plug-and-play methods fail. Its continuous latent variables allow gradients to flow across the entire sequence simultaneously, replacing the discrete-token bottleneck and enabling parallel denoising.

Can reasoning and answers be generated separately in language models?

ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.

Why can't we easily adapt reinforcement learning to diffusion language models?

Diffusion language models cannot directly use AR-developed RL methods like GRPO and DPO because iterative non-sequential token generation requires marginalizing over denoising trajectories, making likelihood intractable. Workarounds exist—outcome-based rewards, policy learning for unmasking order, and adapted preference optimization—enabling models like DCoLT to gain 9–19% on benchmarks.

Can diffusion models commit to answers before full decoding?

Up to 99% of MMLU instances and 97% of GSM8K instances reach correct answers by the midpoint of refinement. Prophet exploits this by monitoring confidence gaps to stop early, achieving 3.4× speedup with no quality loss.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a diffusion-LLM researcher evaluating whether native infilling via bidirectional attention remains a durable capability advantage, or whether recent autoregressive or hybrid methods have narrowed or closed the gap.

What a curated library found — and when (findings span 2022–2025, dated claims not current truth):
• Diffusion-LMs achieve infilling natively without special training because bidirectional attention lets every position condition on both left and right context during parallel denoising (Diffusion-LM, 2022).
• In-place prompting explicitly removes the "prefix-only constraint" of autoregressive models by embedding instructions inside masked positions, which only bidirectional models can refine in parallel (2025).
• Early-exit speedups show diffusion models converge to 99% of benchmark solutions by the midpoint of decoding, implying the model "knows" answers far earlier than sequential generation would (2025).
• Applying RL to diffusion LLMs is hard because parallel generation breaks log-likelihood factorization, forcing intractable likelihood calculations and workarounds (2024–2025).
• Recent surveys on diffusion LMs and consistency training (2025) document the architectural gap but don't rule out that post-hoc alignment or hybrid architectures could close it.

Anchor papers (verify; mind their dates):
• arXiv:2205.14217 — Diffusion-LM Improves Controllable Text Generation (2022)
• arXiv:2508.10736 — In-Place Prompting in Diffusion LLMs (2025)
• arXiv:2508.19982 — Diffusion Language Models Know the Answer Before Decoding (2025)
• arXiv:2508.10875 — A Survey on Diffusion Language Models (2025)

Your task:
(1) RE-TEST THE NATIVE INFILLING CLAIM. Has any recent autoregressive, non-diffusion, or hybrid architecture (e.g., mamba variants, mixture-of-experts with selective masking, or chain-of-thought retraining) matched or exceeded diffusion models' infilling generalization without bidirectional attention? Separate the durable insight (parallel attention enables unsupervised right-context conditioning) from any perishable limitation (only diffusion models can do it efficiently).
(2) Surface the strongest work from the last 6 months that either contradicts the "native infilling" framing (e.g., claims diffusion still needs task-specific tuning) or shows autoregressive models achieving comparable infilling via prompting, caching, or inference-time retrieval.
(3) Propose 2 questions that assume the regime may have shifted: (a) Does post-training with consistency or sycophancy-reduction (arXiv:2510.27062) inadvertently harm the native infilling property? (b) Can in-context learning of infilling patterns in autoregressive models (few-shot, no retraining) now match diffusion's zero-shot performance?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines