INQUIRING LINE

Can outcome-based rewards fully replace per-step likelihood in diffusion RL training?

This explores whether, in diffusion language models, reward-on-the-final-answer can stand in entirely for the per-token likelihood signal that autoregressive RL relies on — and the corpus suggests outcome rewards make diffusion RL *possible* but leave real gaps that step-level signals fill.


This explores whether outcome-based rewards can fully replace per-step likelihood in diffusion RL — and the short version from the corpus is: they're a workaround that unlocks training, not a clean replacement. The core problem is structural. Diffusion language models generate tokens in parallel rather than left-to-right, which breaks the log-likelihood factorization that autoregressive RL methods like GRPO and DPO depend on; computing a clean per-step likelihood would require marginalizing over all the denoising trajectories, which is intractable Why can't we easily adapt reinforcement learning to diffusion language models?. Outcome-based rewards are the escape hatch precisely because they sidestep that intractable per-step term — you score the final answer and skip the likelihood bookkeeping. And it works: models like DCoLT pick up 9–19% on benchmarks this way. So 'can it replace?' is partly answered by 'it already does, out of necessity.'

But the broader RL literature keeps showing that *dense, step-level* signal does things sparse outcome reward can't. Supervised RL rewards a model by how closely each step matches expert actions, and this gives a learning signal even when every rollout fails — exactly the dead zone where outcome-only reward gives you nothing to learn from. The authors frame it as bridging rigid token-by-token imitation and sparse outcome-only RLVR, and note it works best as a *curriculum foundation before* outcome-based refinement, not as a substitute for it Can step-wise expert rewards help small models learn hard reasoning?. That ordering matters: it implies step signal and outcome signal are complementary phases, not interchangeable.

The same complementarity shows up from other angles. Process-level rewards that score *how* an agent reasons — planning, exploration, reflection — cut repetitive actions by 31% versus outcome-only training and generalize better Can RL agents learn to reason better, not just succeed?. And RL training itself appears to move through two phases, where early learning is driven by execution correctness (which outcome reward captures well) but the later bottleneck shifts to strategic planning, where you benefit from concentrating optimization on specific intermediate tokens Does RL training follow a predictable two-phase learning sequence?. Outcome reward is well-matched to the first phase and underpowered for the second — another reason 'fully replace' overreaches.

There's also a quieter warning about what outcome reward optimizes *toward*. Binary correctness rewards provably degrade calibration, because rewarding only the final right/wrong answer incentivizes confident guessing and never penalizes confident wrong answers Does binary reward training hurt model calibration?. The fix there was adding a second reward term — which is itself evidence that a single outcome signal leaves something important unmodeled. Relatedly, work on negative-only reinforcement shows that *what kind* of trajectory-level signal you use changes diversity dramatically: suppressing wrong answers preserves Pass@k while positive-only reward collapses it Does negative reinforcement alone outperform full reinforcement learning?. The signal's shape, not just its presence at the outcome, is doing the work.

The twist worth taking away: even where outcome reward seems sufficient, it may be doing less than it looks. RLVR mostly sharpens sampling toward solutions already in the base model rather than expanding capability, and a single example — or even a spurious reward — can trigger nearly the full effect Does RLVR actually expand what models can reason about? What does reward learning actually do to model reasoning?. So in diffusion RL, outcome rewards can carry the load *because the heavy lifting was done in pretraining* — they activate existing competence cheaply. Where you actually need to teach new procedure, build calibration, or navigate the planning-heavy second phase, the corpus consistently points back to step-level signal. 'Replace fully' isn't the right frame; 'replace where outcome reward is enough, and reintroduce step signal where it isn't' is closer to what the evidence supports.


Sources 8 notes

Why can't we easily adapt reinforcement learning to diffusion language models?

Diffusion language models cannot directly use AR-developed RL methods like GRPO and DPO because iterative non-sequential token generation requires marginalizing over denoising trajectories, making likelihood intractable. Workarounds exist—outcome-based rewards, policy learning for unmasking order, and adapted preference optimization—enabling models like DCoLT to gain 9–19% on benchmarks.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL researcher stress-testing a claim about diffusion language models. The question: can outcome-based rewards fully replace per-step likelihood in diffusion RL training?

What a curated library found — and when (findings from 2024–10 span the path; treat as dated claims, not current truth):
• Outcome rewards sidestep intractable per-step likelihood marginalization in parallel-denoising diffusion models, unlocking training but not cleanly replacing it (2025-08, diffusion survey).
• Dense step-level rewards (from supervised RL) enable learning even on failed rollouts; outcome-only signal gives zero gradient in dead zones — they're complementary phases, not substitutes (2025-10).
• Process rewards (reasoning, planning steps) cut repetitive actions 31% vs. outcome-only and generalize better; RL exhibits two phases where early learning is execution-driven but later bottleneck shifts to strategic planning (2025-07).
• Binary correctness rewards provably degrade calibration and incentivize confident guessing; negative reinforcement alone preserves diversity while positive-only collapses it — signal *shape* matters (2025-06).
• RLVR mostly sharpens existing base-model competence rather than expanding capability; a single spurious reward can trigger near-full effect (2025-04).

Anchor papers (verify; mind their dates):
• 2025-10 arXiv:2510.25992 Supervised Reinforcement Learning
• 2025-07 arXiv:2507.22844 RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards
• 2025-06 arXiv:2506.01347 The Surprising Effectiveness of Negative Reinforcement
• 2025-04 arXiv:2504.13837 Does Reinforcement Learning Really Incentivize Reasoning Capacity

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (GPT-4o, Claude-4, Deepseek-R1), scaling, multi-stage training pipelines, or advances in continuous reward modeling have since relaxed or overturned it. Separate the durable question (per-step vs. outcome signal trade-off in *any* generation regime) from perishable claims (specifics of diffusion model tractability, RLVR dynamics). Cite what resolved each constraint; flag what still holds.
(2) Surface the strongest *contradicting* or *superseding* work from the last ~3 months—especially any showing outcome reward alone *does* scale to complex reasoning without step-level signal, or showing diffusion models with tractable per-step likelihoods.
(3) Propose two research questions that assume the regime may have shifted: one on whether multi-token, tree-search-aware outcome rewards can close the planning gap; one on whether process supervision (step rewards) can be *learned* rather than hand-engineered, making the distinction moot.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines