INQUIRING LINE

Does process supervision recover reasoning accuracy better than outcome rewards in latent space?

This explores whether step-by-step (process) feedback restores reasoning accuracy more effectively than answer-only (outcome) rewards — and reads the 'latent space' angle as a question about where reasoning actually lives in the model.


This explores whether grading the reasoning steps (process supervision) recovers accuracy better than grading only the final answer (outcome rewards) — and the corpus reframes the question in a way you might not expect: the gap between the two is smaller than it looks, because much of the reasoning is already latent in the model and the reward's real job is to surface it. A striking thread is that outcome-only signals leave critical information on the table: numerical rewards tell a model *that* it failed but not *why*, and models stuck on plateaus break through when given chain-of-thought critiques instead of scalars Can natural language feedback overcome numerical reward plateaus?. That's the core case for process supervision — denser, step-level information recovers accuracy that outcome rewards can't.

But process supervision is expensive (human step annotations), and the corpus shows several ways to get its granularity cheaply from outcome signals. Reverse-curriculum RL slides the starting point of a problem backward from near-completion, exposing step-level failure modes using only final-answer feedback — process-supervision resolution without the labels Can curriculum learning approximate expensive process supervision?. And when you do build process judges, the design matters more than the process/outcome split: generative judges that *reason about* each reasoning step beat classifier-style step scorers, and do it with orders of magnitude less data Can judges that reason about reasoning outperform classifier rewards?. Process reward can also be mined structurally — from what search agents read but don't cite — which captures intermediate quality while blocking the reward-hacking that step-level signals are prone to Can search agent behavior yield reliable process rewards for reasoning?.

Now the 'latent space' twist, which complicates the premise. A line of work argues reward learning doesn't *teach* reasoning at all — it *activates* strategies already present in the base model. RLVR improves how efficiently a model samples within its existing capability boundary without expanding it, and spurious rewards work nearly as well as correct ones for models with the right pretraining What does reward learning actually do to model reasoning?. Five independent mechanisms — RL steering, critique fine-tuning, decoding tweaks, SAE feature steering — all elicit reasoning sitting latent in base-model activations, suggesting post-training *selects* rather than creates Do base models already contain hidden reasoning ability?. If that's true, the real question isn't 'process vs. outcome' but 'which signal best elicits what's already there' — and reasoning verbosity itself turns out to be a single steerable linear direction in activation space Can we steer reasoning toward brevity without retraining?.

Seen that way, the most interesting alternatives sidestep both human process labels and external verifiers entirely. Model confidence can serve as an intrinsic reward that ranks reasoning traces, strengthening step-by-step reasoning while *reversing* the calibration damage RLHF causes — no human labels needed Can model confidence work as a reward signal for reasoning?. RL training can also redirect a model's extended thinking from counterproductive self-doubt into productive gap analysis, showing the signal shapes reasoning *quality*, not just quantity Does extended thinking help or hurt model reasoning?.

The takeaway you didn't know you wanted: 'process beats outcome' is the wrong frame to win on. The corpus suggests outcome rewards underperform mainly because scalars are information-poor, that process-level granularity is recoverable from outcome signals via clever curricula, and that — because so much reasoning is latent — the deepest lever may be the *interface* carrying the feedback rather than where you place the reward. One synthesis goes further: the reusable unit of reasoning improvement isn't a dataset but a verifier-bearing feedback interface entangled with the base model, optimizer, and scaffold — change any one and the same data behaves differently What is the actual reusable unit of reasoning data?.


Sources 10 notes

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Can search agent behavior yield reliable process rewards for reasoning?

LongTraceRL mines entity-level reasoning signals from what search agents read but don't cite—the hardest distractors—and applies rubric rewards only to correct answers, structurally blocking reward fabrication while capturing intermediate reasoning quality.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

What is the actual reusable unit of reasoning data?

The reusable unit in post-training is a feedback interface entangled with six factors: verifier, base model, lineage, optimizer, scaffold, and budget. Changing any one alters the same data's effect, making attribution tractable only when these are jointly released.

Next inquiring lines