Can outcome-based rewards fully replace per-step likelihood in diffusion RL training?
This explores whether, in diffusion language models, reward-on-the-final-answer can stand in entirely for the per-token likelihood signal that autoregressive RL relies on — and the corpus suggests outcome rewards make diffusion RL *possible* but leave real gaps that step-level signals fill.
This explores whether outcome-based rewards can fully replace per-step likelihood in diffusion RL — and the short version from the corpus is: they're a workaround that unlocks training, not a clean replacement. The core problem is structural. Diffusion language models generate tokens in parallel rather than left-to-right, which breaks the log-likelihood factorization that autoregressive RL methods like GRPO and DPO depend on; computing a clean per-step likelihood would require marginalizing over all the denoising trajectories, which is intractable Why can't we easily adapt reinforcement learning to diffusion language models?. Outcome-based rewards are the escape hatch precisely because they sidestep that intractable per-step term — you score the final answer and skip the likelihood bookkeeping. And it works: models like DCoLT pick up 9–19% on benchmarks this way. So 'can it replace?' is partly answered by 'it already does, out of necessity.'
But the broader RL literature keeps showing that *dense, step-level* signal does things sparse outcome reward can't. Supervised RL rewards a model by how closely each step matches expert actions, and this gives a learning signal even when every rollout fails — exactly the dead zone where outcome-only reward gives you nothing to learn from. The authors frame it as bridging rigid token-by-token imitation and sparse outcome-only RLVR, and note it works best as a *curriculum foundation before* outcome-based refinement, not as a substitute for it Can step-wise expert rewards help small models learn hard reasoning?. That ordering matters: it implies step signal and outcome signal are complementary phases, not interchangeable.
The same complementarity shows up from other angles. Process-level rewards that score *how* an agent reasons — planning, exploration, reflection — cut repetitive actions by 31% versus outcome-only training and generalize better Can RL agents learn to reason better, not just succeed?. And RL training itself appears to move through two phases, where early learning is driven by execution correctness (which outcome reward captures well) but the later bottleneck shifts to strategic planning, where you benefit from concentrating optimization on specific intermediate tokens Does RL training follow a predictable two-phase learning sequence?. Outcome reward is well-matched to the first phase and underpowered for the second — another reason 'fully replace' overreaches.
There's also a quieter warning about what outcome reward optimizes *toward*. Binary correctness rewards provably degrade calibration, because rewarding only the final right/wrong answer incentivizes confident guessing and never penalizes confident wrong answers Does binary reward training hurt model calibration?. The fix there was adding a second reward term — which is itself evidence that a single outcome signal leaves something important unmodeled. Relatedly, work on negative-only reinforcement shows that *what kind* of trajectory-level signal you use changes diversity dramatically: suppressing wrong answers preserves Pass@k while positive-only reward collapses it Does negative reinforcement alone outperform full reinforcement learning?. The signal's shape, not just its presence at the outcome, is doing the work.
The twist worth taking away: even where outcome reward seems sufficient, it may be doing less than it looks. RLVR mostly sharpens sampling toward solutions already in the base model rather than expanding capability, and a single example — or even a spurious reward — can trigger nearly the full effect Does RLVR actually expand what models can reason about? What does reward learning actually do to model reasoning?. So in diffusion RL, outcome rewards can carry the load *because the heavy lifting was done in pretraining* — they activate existing competence cheaply. Where you actually need to teach new procedure, build calibration, or navigate the planning-heavy second phase, the corpus consistently points back to step-level signal. 'Replace fully' isn't the right frame; 'replace where outcome reward is enough, and reintroduce step signal where it isn't' is closer to what the evidence supports.
Sources 8 notes
Diffusion language models cannot directly use AR-developed RL methods like GRPO and DPO because iterative non-sequential token generation requires marginalizing over denoising trajectories, making likelihood intractable. Workarounds exist—outcome-based rewards, policy learning for unmasking order, and adapted preference optimization—enabling models like DCoLT to gain 9–19% on benchmarks.
Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.
RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.