INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›How can process reward models supe…›this inquiring line

Grading AI only on final answers is cheap but misleading — is it worth paying more to grade every step instead?

How do outcome-based and process-based reward models differ in supervision cost?

This explores the central trade-off in reward model design — outcome-based models are cheap to train but blind to intermediate steps, while process-based models see every step but historically demanded expensive human annotation — and surveys the tricks the corpus has found to get process-level supervision without paying process-level cost.

This explores why process supervision costs more than outcome supervision, and how recent work is trying to collapse that gap. The cleanest statement of the trade-off is that outcome-based reward models (ORMs) learn only from whether the final answer was right, which makes them cheap — you just need the end result — but also systematically pessimistic about good intermediate steps, since a correct step that happens to sit inside a failed trajectory gets punished by association. Process reward models (PRMs) fix that by scoring each step directly, but the classic price is skilled human annotation of every step in a reasoning chain Why do outcome-based reward models fail at intermediate step evaluation?. And the payoff is real: when you actually supervise the intermediate steps — say, each retrieval in an agentic RAG pipeline — performance beats final-answer-only rewards by a wide margin Does supervising retrieval steps outperform final answer rewards?. So the field isn't debating whether process supervision is better. It's debating how to afford it.

The most interesting thread in the corpus is a cluster of methods that manufacture step-level signal out of cheap outcome signal — essentially getting PRM granularity at ORM prices. Tree-search rollouts do it through branching: by comparing sibling subtrees that share a prefix, a single trajectory-level reward gets decomposed into step-wise preferences automatically, with no separate PRM and no step annotation Can tree structure alone convert outcome rewards into process supervision?. This turns out to be one instance of a broader pattern — trajectory *structure* itself can stand in for annotated process rewards, whether you exploit tree topology, expert-aligned actions, or the positions of tool calls Can trajectory structure replace hand-annotated process rewards?. Reverse-curriculum learning gets there a different way: it slides the reasoning start point backward from near-completion, so failures surface step-by-step using nothing but outcome feedback Can curriculum learning approximate expensive process supervision?.

The other route to cheaper process supervision attacks the annotation bottleneck head-on. Self-supervised PRMs replace human step labels with dynamically weighted pseudo-labels and still reach o3-mini-level results — though whether this holds in fuzzy-outcome domains is unproven Can self-supervised process rewards replace human annotation?. And a striking efficiency finding: PRMs trained as *generative* judges that reason about each step, rather than classifiers that score it, achieve better accuracy with orders of magnitude less training data Can judges that reason about reasoning outperform classifier rewards?. That reframes "cost" — it's not just annotation, it's how much labeled data the reward model needs to become competent.

Worth knowing if you're chasing this further: the supervision-cost question quietly bleeds into a representation question. Scalar rewards — outcome or process — throw away information that natural feedback carries, namely the *directive* part (how an action should change) as opposed to the *evaluative* part (how well it did) Can scalar rewards capture all the information in agent feedback?. And there's an even cheaper frontier than either reward type: agents that treat the consequences of their own actions as supervision, learning from future states with no external reward model at all Can agents learn from their own actions without external rewards?. The arc across the corpus is clear — the expensive thing was never "process" itself, it was *human-annotated* process, and most of the recent ingenuity is about extracting step-level signal from structure, self-supervision, or the agent's own rollouts.

Sources 9 notes

Why do outcome-based reward models fail at intermediate step evaluation?

ORMs systematically underestimate intermediate steps due to training only on final outcomes, producing high false-negative rates. PRMs solve this with step-level feedback but demand costly skilled annotation, revealing a core trade-off in reward model design.

Does supervising retrieval steps outperform final answer rewards?

Fine-grained feedback on intermediate retrieval steps significantly boosts agentic RAG performance compared to final-answer-only rewards. DPO trained with both positive and negative step feedback outperforms PPO and single-direction training by directly contrasting good and bad retrieval chains.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Show all 9 sources

Can self-supervised process rewards replace human annotation?

MetaStone-S1's SPRM achieves o3-mini-level results using dynamic weighting of pseudo-labels instead of human-annotated steps. This eliminates the annotation bottleneck for process supervision, though generalization to fuzzy-outcome domains remains unproven.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Reasoning Language Models: A Blueprint5.08 match · arxiv ↗
Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning4.14 match · arxiv ↗
Test-Time Scaling with Reflective Generative Model3.39 match · arxiv ↗
StepWiser: Stepwise Generative Judges for Wiser Reasoning3.35 match · arxiv ↗
OpenClaw-RL: Train Any Agent Simply by Talking3.25 match · arxiv ↗
Tree Search for LLM Agent Reinforcement Learning2.57 match · arxiv ↗
GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning2.47 match · arxiv ↗
Let’s Verify Step by Step2.47 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reward modeling researcher evaluating whether the outcome vs. process supervision cost trade-off still holds. The question remains open: *Can we achieve process-level supervision granularity without proportional annotation burden?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as perishable constraints:
• Outcome-based reward models (ORMs) are systematically pessimistic about intermediate steps in failed trajectories, harming intermediate step learning (~2024).
• Process reward models (PRMs) outperform outcome-only rewards substantially but require costly human step annotation (~2024–2025).
• Tree-search rollouts can decompose single trajectory rewards into step-wise preferences without separate PRM annotation, using only branching structure (~2025).
• Self-supervised PRMs using pseudo-labels achieve o3-mini-level results, though fuzzy-outcome generalization is unproven (~2025).
• Generative stepwise judges (meta-reasoning over steps) outperform scalar classifiers with orders-of-magnitude less training data (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2402.05808 (2024) — Reverse Curriculum RL
• arXiv:2509.21240 (2025) — Tree Search for LLM Agent RL
• arXiv:2508.19229 (2025) — StepWiser Generative Judges
• arXiv:2510.08558 (2025) — Early Experience as Supervision

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, determine whether newer models, training recipes, or multi-agent orchestration have since RELAXED it. Separate the durable question (e.g., *can we avoid human step annotation entirely?*) from the perishable limitation (e.g., *self-supervised PRMs fail on fuzzy outcomes*). Cite what relaxed each constraint, or state plainly where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially work claiming process supervision is *not* worth the cost, or that outcome signals suffice with better architecture.
(3) Propose 2 research questions that ASSUME the cost regime may have shifted: e.g., what if generative judges + tree search make annotation costs negligible? What if long-horizon credit assignment dissolves the ORM pessimism problem?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Grading AI only on final answers is cheap but misleading — is it worth paying more to grade every step instead?

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8