INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›How can process reward models supe…›this inquiring line

Does grading an AI only on its final answer reward good thinking — or just lucky shortcuts?

How does process-based reward differ from outcome-only reward in training?

This explores the difference between rewarding a model only for getting the final answer right (outcome-only) versus rewarding the quality of each intermediate reasoning step (process-based) — and what each does to how the model actually learns.

This explores the difference between rewarding a model only for its final answer (outcome-only) versus scoring each step of its reasoning along the way (process-based) — and the corpus has a lot to say about why that distinction matters more than it first appears. The cleanest statement of the core trade-off is that outcome-based reward models are systematically *pessimistic* about intermediate steps: because they only ever see whether the final answer was right, they punish good intermediate moves that happened to sit inside a trajectory that later went wrong, producing high false-negative rates Why do outcome-based reward models fail at intermediate step evaluation?. Process reward models (PRMs) fix this by scoring each step directly — but the catch is they traditionally need expensive, skilled human annotation of every step. That's the central tension: outcome rewards are cheap but blunt, process rewards are sharp but costly.

Much of the recent corpus is really about *escaping* that trade-off — getting process-like supervision without paying for step annotation. The most elegant trick is using the structure of the reasoning itself: tree-search rollouts branch a problem into siblings, then compare subtrees so that a single final-answer reward gets automatically converted into step-level preference signals, no separate PRM required Can tree structure alone convert outcome rewards into process supervision?. This isn't a one-off — several methods (tree topology, expert-aligned actions, tool-call positions) all exploit different structural features of a trajectory to turn sparse outcome rewards into dense step signals Can trajectory structure replace hand-annotated process rewards?. So the line between 'outcome' and 'process' is softer than it sounds: you can manufacture process supervision out of outcome rewards if the trajectory has enough structure to mine.

A second thread is what process reward buys you that outcomes can't. Outcome-only training optimizes for *being right*, which quietly means it optimizes for *guessing confidently* — binary correctness rewards degrade calibration because a confident wrong answer isn't penalized any more than a hesitant one Does binary reward training hurt model calibration?. Process-style rewards can target qualities outcomes are blind to: rewarding metacognitive moves like planning, exploration, and reflection cuts wasteful repeated actions by nearly a third while keeping generalization, compared to outcome-only training that only cares about the endpoint Can RL agents learn to reason better, not just succeed?. There's even evidence that *how* you judge steps matters as much as whether you judge them: judges trained to reason about reasoning beat classifier-style reward models that just stamp steps good/bad Can judges that reason about reasoning outperform classifier rewards?.

The deeper, slightly unsettling finding is that reward — process or outcome — may do less 'teaching' than the framing implies. RLVR appears to *activate* reasoning strategies the model already learned in pretraining rather than installing new ones, to the point where spurious rewards work nearly as well as correct ones What does reward learning actually do to model reasoning?. And the value of step-level signal seems to shift over training: RL moves through a two-phase dynamic where early learning is driven by getting execution correct, and only later does strategic planning become the bottleneck Does RL training follow a predictable two-phase learning sequence?. That suggests process vs. outcome isn't a fixed choice but a moving target — the kind of feedback that helps most depends on which phase the model is in.

Worth a sideways glance: the corpus complicates the whole 'reward' frame. Scalar rewards (whether per-step or per-outcome) throw away information — natural feedback splits into *evaluative* ('how good was that') and *directive* ('here's how to change it'), and a single number can only carry the first Can scalar rewards capture all the information in agent feedback?. Other work finds you can match full RL using only the *negative* signal — suppressing wrong trajectories while preserving diversity — which positive-only reward tends to collapse Does negative reinforcement alone outperform full reinforcement learning?. If you want to go deeper on keeping dense rewards honest, the cleanest result is that rubrics work better as *gates* that accept or reject whole rollouts than as scores converted into dense reward, which invites hacking Can rubrics and dense rewards work together without hacking?.

Sources 11 notes

Why do outcome-based reward models fail at intermediate step evaluation?

ORMs systematically underestimate intermediate steps due to training only on final outcomes, producing high false-negative rates. PRMs solve this with step-level feedback but demand costly skilled annotation, revealing a core trade-off in reward model design.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Show all 11 sources

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Reasoning Language Models: A Blueprint3.43 match · arxiv ↗
Reinforcement Learning with Rubric Anchors3.32 match · arxiv ↗
Reward Reasoning Model3.27 match · arxiv ↗
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning2.54 match · arxiv ↗
From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR2.54 match · arxiv ↗
Test-Time Scaling with Reflective Generative Model2.53 match · arxiv ↗
StepWiser: Stepwise Generative Judges for Wiser Reasoning2.52 match · arxiv ↗
OpenClaw-RL: Train Any Agent Simply by Talking2.43 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM research analyst. The question: **Does process-based reward (scoring intermediate steps) fundamentally outperform outcome-only reward (final-answer scoring) in training agentic LLMs, or does the gap dissolve under different model scales, training regimes, or evaluation metrics?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–09 through 2026–02.
- Outcome-only rewards are **systematically pessimistic** about intermediate steps, producing high false-negative rates on good moves in failed trajectories (2024–09, 2025–05).
- Process supervision can be **manufactured from outcome rewards** via tree-search rollouts and structural trajectory mining without separate PRM annotation (2025–09).
- Process rewards improve **calibration and metacognitive behaviors** (planning, reflection) — outcome-only training optimizes for confident guessing and cuts metacognition gains by ~30% (2025–05, 2025–08).
- **RLVR appears to *activate* pre-learned strategies rather than install new ones; spurious rewards work nearly as well as correct ones** (2025–07).
- RL training exhibits a **two-phase dynamic**: early phase prioritizes execution correctness; process-level feedback becomes a bottleneck only later (2025–06, 2026–02).

Anchor papers (verify; mind their dates):
- arXiv:2505.14674 *Reward Reasoning Model* (2025–05)
- arXiv:2509.21240 *Tree Search for LLM Agent Reinforcement Learning* (2025–09)
- arXiv:2507.14843 *The Invisible Leash: Why RLVR May Not Escape Its Origin* (2025–07)
- arXiv:2508.19229 *StepWiser: Stepwise Generative Judges for Wiser Reasoning* (2025–08)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, assess whether newer models, scaling laws, multi-agent orchestration (memory/caching), or evaluation tooling have since **relaxed or overturned** the limitation. Specifically: Does the "pessimism" of outcome rewards persist at model scales >100B parameters? Has the cost of step-level annotation dropped via crowdsourcing or synthetic data since 2025–06? Does the two-phase dynamic hold across different reasoning tasks (math, code, planning), or is it task-specific? Separate the **durable question** (whether process fundamentally teaches better cognition) from the **perishable limitation** (cost/annotation burden).

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months — any paper showing process and outcome converge, or outcome alone matching process under certain conditions.

(3) **Propose 2 research questions that ASSUME the regime may have moved:** (a) If outcome rewards now yield process-like feedback via architectural choices (e.g., chain-of-thought masking, intermediate checkpoints), does the *label cost* disappear while the *reasoning quality* gap closes? (b) If RLVR merely activates pre-learned reasoning, does scaling pretraining (e.g., more diverse reasoning tasks in SFT) obviate the need for expensive RL reward design altogether?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does grading an AI only on its final answer reward good thinking — or just lucky shortcuts?

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8