INQUIRING LINE

What information do numerical rewards fail to provide for reasoning tasks?

This explores what a single number — the reward signal in reinforcement learning — leaves out when a model is learning to reason, and what kinds of feedback can fill that gap.


This explores what a single number — the reward signal in reinforcement learning — leaves out when a model is learning to reason, and what kinds of feedback can fill that gap. The corpus converges on a clear diagnosis: a scalar reward tells a model *how well* it did but never *why* it failed or *how* to change. One note frames this precisely — agent feedback actually carries two orthogonal kinds of information, evaluative (how good was this?) and directive (what should be different?), and a scalar collapses both into one axis, throwing away the directional half Can scalar rewards capture all the information in agent feedback?. That missing directive content is exactly what stalls learning: models that plateau under numerical rewards start solving problems again when handed chain-of-thought critiques that explain the failure, suggesting the plateau was never a capability ceiling but an information starvation Can natural language feedback overcome numerical reward plateaus?.

The sparsity problem is the other half of the story. An outcome-only reward says nothing until the very end, and says nothing at all when every attempt fails — so a model that can't yet solve a hard problem gets a flat zero with no gradient to climb. Several notes attack this by manufacturing the per-step signal the scalar omits: step-wise expert-similarity rewards give dense feedback even when all rollouts are wrong Can step-wise expert rewards help small models learn hard reasoning?, and information-theoretic methods compute each step's contribution to the final answer without any human annotation Can we reward reasoning steps without human annotation?. Relatedly, judges that *reason about* the reasoning — producing a critique chain rather than a classification score — outperform discriminative reward models, because the act of reasoning recovers the explanatory information a bare score discards Can judges that reason about reasoning outperform classifier rewards?, Can reward models benefit from reasoning before scoring?.

Here's the genuinely surprising turn, though. A second cluster of notes suggests the reward may not be teaching reasoning at all — which reframes what it's even *supposed* to provide. RLVR appears to activate strategies already latent in pretraining rather than installing new ones: a single training example can lift math accuracy from 36% to 73.6% Can a single training example unlock mathematical reasoning?, and even spurious or random rewards work nearly as well as correct ones for models with the right pretraining What does reward learning actually do to model reasoning?. If reward is mostly an activation switch, then asking it to also *carry* fine-grained instructional content is asking the wrong tool to do two jobs.

That tension resolves into a design principle running through the corpus: keep the scalar for what it's good at, and source the missing information elsewhere. Rubrics work better as gates that accept or reject whole rollouts than as scores converted into dense reward — using them as a reward invites hacking, using them as a filter preserves their categorical judgment while letting token-level signals optimize inside the valid region Can rubrics and dense rewards work together without hacking?. And the information a reward provides isn't even uniform across tasks: knowledge lives in lower network layers, reasoning in higher ones, so the same reasoning-reward that sharpens math can corrode knowledge-heavy domains like medicine Why does reasoning training help math but hurt medical tasks?.

The thing you might not have known you wanted to know: the reliability of the reward signal itself is suspect. On contaminated benchmarks, RLVR's apparent reasoning gains turn out to be memorization — a model reconstructs half of MATH-500 from partial prompts but scores zero on a clean post-release benchmark Does RLVR success on math benchmarks reflect genuine reasoning improvement?. So numerical rewards don't just fail to provide the *why* and the *how*; sometimes they fail to honestly report the *whether*.


Sources 11 notes

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Can we reward reasoning steps without human annotation?

L2T uses PAC-Bayes bounds and Fisher information to compute per-episode rewards measuring each step's contribution to correctness. This annotation-free approach matches dense feedback quality while eliminating the cost of outcome-only methods that produce 2x excess tokens.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher auditing claims about reward signals in reasoning tasks. The question remains open: *what information do numerical rewards structurally fail to provide when training models on reasoning?*

What a curated library found — and when (findings span Jan–Oct 2025; dated claims, not current truth):
• Scalar rewards collapse two orthogonal signal types — evaluative (how good?) and directive (what changes?) — into one axis, discarding instructional content (2025-06).
• Natural language critiques break performance plateaus where numerical rewards stall, suggesting information starvation, not capability ceiling (2025-06).
• Dense step-wise rewards (expert-similarity, information-theoretic) recover gradient where outcome-only rewards give flat zeros (2025-06, 2025-10).
• Generative judges that reason about reasoning steps outperform discriminative reward models by reconstructing explanatory structure a bare score erases (2025-08, 2025-05).
• RLVR may activate latent pretraining strategies rather than install new ones; spurious rewards work nearly as well as correct ones (2025-04, 2025-07).
• On contaminated benchmarks, RLVR's reasoning gains are primarily memorization, not genuine learning (2025-07).

Anchor papers (verify; mind their dates):
• arXiv:2504.20571 (2025-04) — RLVR activation dynamics.
• arXiv:2506.13351 (2025-06) — Token-level reasoning + rubric gates.
• arXiv:2507.10532 (2025-07) — Data contamination and reliability.
• arXiv:2508.19229 (2025-08) — Generative stepwise judges.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the claimed directive-information gap, evaluative-only collapse, and sparsity problem: have newer models (o3, Claude reasoning tokens, or 2025-Q4 equivalents), dense reward innovations (process supervision, outcome-conditional step rewards), or better rubric designs since Oct 2025 **dissolved** these limits? Separate durable questions (likely still open) from perishable limitations. Where constraints still hold, ground that claim.
(2) Surface the strongest **contradicting or superseding work** from the last ~6 months that argues numerical rewards are *sufficient*, or that directive information is recoverable *within* the scalar, or that the memorization problem is overstated.
(3) Propose 2 research questions that assume the regime may have moved: one probing whether reward structure itself (vector vs. scalar vs. graph) is the bottleneck, the other testing whether hybrid feedback (rubric gate + dense token reward + reasoning-model critique) can overcome both sparsity and directiveness gaps simultaneously.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines