How does process-based reward differ from outcome-only reward in training?
This explores the difference between rewarding a model only for getting the final answer right (outcome-only) versus rewarding the quality of each intermediate reasoning step (process-based) — and what each does to how the model actually learns.
This explores the difference between rewarding a model only for its final answer (outcome-only) versus scoring each step of its reasoning along the way (process-based) — and the corpus has a lot to say about why that distinction matters more than it first appears. The cleanest statement of the core trade-off is that outcome-based reward models are systematically *pessimistic* about intermediate steps: because they only ever see whether the final answer was right, they punish good intermediate moves that happened to sit inside a trajectory that later went wrong, producing high false-negative rates Why do outcome-based reward models fail at intermediate step evaluation?. Process reward models (PRMs) fix this by scoring each step directly — but the catch is they traditionally need expensive, skilled human annotation of every step. That's the central tension: outcome rewards are cheap but blunt, process rewards are sharp but costly.
Much of the recent corpus is really about *escaping* that trade-off — getting process-like supervision without paying for step annotation. The most elegant trick is using the structure of the reasoning itself: tree-search rollouts branch a problem into siblings, then compare subtrees so that a single final-answer reward gets automatically converted into step-level preference signals, no separate PRM required Can tree structure alone convert outcome rewards into process supervision?. This isn't a one-off — several methods (tree topology, expert-aligned actions, tool-call positions) all exploit different structural features of a trajectory to turn sparse outcome rewards into dense step signals Can trajectory structure replace hand-annotated process rewards?. So the line between 'outcome' and 'process' is softer than it sounds: you can manufacture process supervision out of outcome rewards if the trajectory has enough structure to mine.
A second thread is what process reward buys you that outcomes can't. Outcome-only training optimizes for *being right*, which quietly means it optimizes for *guessing confidently* — binary correctness rewards degrade calibration because a confident wrong answer isn't penalized any more than a hesitant one Does binary reward training hurt model calibration?. Process-style rewards can target qualities outcomes are blind to: rewarding metacognitive moves like planning, exploration, and reflection cuts wasteful repeated actions by nearly a third while keeping generalization, compared to outcome-only training that only cares about the endpoint Can RL agents learn to reason better, not just succeed?. There's even evidence that *how* you judge steps matters as much as whether you judge them: judges trained to reason about reasoning beat classifier-style reward models that just stamp steps good/bad Can judges that reason about reasoning outperform classifier rewards?.
The deeper, slightly unsettling finding is that reward — process or outcome — may do less 'teaching' than the framing implies. RLVR appears to *activate* reasoning strategies the model already learned in pretraining rather than installing new ones, to the point where spurious rewards work nearly as well as correct ones What does reward learning actually do to model reasoning?. And the value of step-level signal seems to shift over training: RL moves through a two-phase dynamic where early learning is driven by getting execution correct, and only later does strategic planning become the bottleneck Does RL training follow a predictable two-phase learning sequence?. That suggests process vs. outcome isn't a fixed choice but a moving target — the kind of feedback that helps most depends on which phase the model is in.
Worth a sideways glance: the corpus complicates the whole 'reward' frame. Scalar rewards (whether per-step or per-outcome) throw away information — natural feedback splits into *evaluative* ('how good was that') and *directive* ('here's how to change it'), and a single number can only carry the first Can scalar rewards capture all the information in agent feedback?. Other work finds you can match full RL using only the *negative* signal — suppressing wrong trajectories while preserving diversity — which positive-only reward tends to collapse Does negative reinforcement alone outperform full reinforcement learning?. If you want to go deeper on keeping dense rewards honest, the cleanest result is that rubrics work better as *gates* that accept or reject whole rollouts than as scores converted into dense reward, which invites hacking Can rubrics and dense rewards work together without hacking?.
Sources 11 notes
ORMs systematically underestimate intermediate steps due to training only on final outcomes, producing high false-negative rates. PRMs solve this with step-level feedback but demand costly skilled annotation, revealing a core trade-off in reward model design.
Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.
Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.
StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.