How do outcome-based and process-based reward models differ in supervision cost?
This explores the central trade-off in reward model design — outcome-based models are cheap to train but blind to intermediate steps, while process-based models see every step but historically demanded expensive human annotation — and surveys the tricks the corpus has found to get process-level supervision without paying process-level cost.
This explores why process supervision costs more than outcome supervision, and how recent work is trying to collapse that gap. The cleanest statement of the trade-off is that outcome-based reward models (ORMs) learn only from whether the final answer was right, which makes them cheap — you just need the end result — but also systematically pessimistic about good intermediate steps, since a correct step that happens to sit inside a failed trajectory gets punished by association. Process reward models (PRMs) fix that by scoring each step directly, but the classic price is skilled human annotation of every step in a reasoning chain Why do outcome-based reward models fail at intermediate step evaluation?. And the payoff is real: when you actually supervise the intermediate steps — say, each retrieval in an agentic RAG pipeline — performance beats final-answer-only rewards by a wide margin Does supervising retrieval steps outperform final answer rewards?. So the field isn't debating whether process supervision is better. It's debating how to afford it.
The most interesting thread in the corpus is a cluster of methods that manufacture step-level signal out of cheap outcome signal — essentially getting PRM granularity at ORM prices. Tree-search rollouts do it through branching: by comparing sibling subtrees that share a prefix, a single trajectory-level reward gets decomposed into step-wise preferences automatically, with no separate PRM and no step annotation Can tree structure alone convert outcome rewards into process supervision?. This turns out to be one instance of a broader pattern — trajectory *structure* itself can stand in for annotated process rewards, whether you exploit tree topology, expert-aligned actions, or the positions of tool calls Can trajectory structure replace hand-annotated process rewards?. Reverse-curriculum learning gets there a different way: it slides the reasoning start point backward from near-completion, so failures surface step-by-step using nothing but outcome feedback Can curriculum learning approximate expensive process supervision?.
The other route to cheaper process supervision attacks the annotation bottleneck head-on. Self-supervised PRMs replace human step labels with dynamically weighted pseudo-labels and still reach o3-mini-level results — though whether this holds in fuzzy-outcome domains is unproven Can self-supervised process rewards replace human annotation?. And a striking efficiency finding: PRMs trained as *generative* judges that reason about each step, rather than classifiers that score it, achieve better accuracy with orders of magnitude less training data Can judges that reason about reasoning outperform classifier rewards?. That reframes "cost" — it's not just annotation, it's how much labeled data the reward model needs to become competent.
Worth knowing if you're chasing this further: the supervision-cost question quietly bleeds into a representation question. Scalar rewards — outcome or process — throw away information that natural feedback carries, namely the *directive* part (how an action should change) as opposed to the *evaluative* part (how well it did) Can scalar rewards capture all the information in agent feedback?. And there's an even cheaper frontier than either reward type: agents that treat the consequences of their own actions as supervision, learning from future states with no external reward model at all Can agents learn from their own actions without external rewards?. The arc across the corpus is clear — the expensive thing was never "process" itself, it was *human-annotated* process, and most of the recent ingenuity is about extracting step-level signal from structure, self-supervision, or the agent's own rollouts.
Sources 9 notes
ORMs systematically underestimate intermediate steps due to training only on final outcomes, producing high false-negative rates. PRMs solve this with step-level feedback but demand costly skilled annotation, revealing a core trade-off in reward model design.
Fine-grained feedback on intermediate retrieval steps significantly boosts agentic RAG performance compared to final-answer-only rewards. DPO trained with both positive and negative step feedback outperforms PPO and single-direction training by directly contrasting good and bad retrieval chains.
Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.
Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.
R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.
MetaStone-S1's SPRM achieves o3-mini-level results using dynamic weighting of pseudo-labels instead of human-annotated steps. This eliminates the annotation bottleneck for process supervision, though generalization to fuzzy-outcome domains remains unproven.
StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.