INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›Can language model RL training avo…›this inquiring line

If the base model already knows the answer, does it matter what reward signal you use to unlock it?

Is elaborate reward shaping necessary if the pretrained prior already contains good solutions?

This explores a tension in the corpus: when a model's pretraining already contains good solutions, does reward engineering actually matter — or does almost any signal work because RL is just surfacing what's already there?

This explores whether elaborate reward shaping earns its keep when the pretrained prior already holds good solutions — and the corpus leans surprisingly hard toward "no, not for the reasons you'd think." The strongest version of this argument is that reinforcement learning on verifiable rewards doesn't teach new reasoning at all; it sharpens sampling toward solutions the base model could already produce Does RLVR actually expand what models can reason about?. If RL is mostly an activation mechanism, then the precise shape of the reward should matter less than whether it points roughly the right way. That prediction holds up almost shockingly well: a single training example can trigger the gains, and *spurious* rewards work nearly as well as correct ones — provided the pretraining was good enough to have the strategy waiting What does reward learning actually do to model reasoning?. In that light, elaborate shaping is often dressing on a process that's really just unlocking a latent prior.

A second cluster pushes the same direction from different angles, suggesting the reward can be radically minimal. Negative reinforcement alone — just suppressing wrong trajectories, never rewarding right ones — matches or beats full PPO/GRPO, and it preserves answer diversity that positive-only training collapses Does negative reinforcement alone outperform full reinforcement learning?. You can also let the model supply its own dense signal: an agent's belief-shift toward the solution serves as intrinsic per-turn credit with no critic network or process-reward model bolted on Can an agent's own beliefs guide credit assignment without critics?. And on hard problems, sophisticated domain reasoning emerges from plain accuracy signals with no chain-of-thought distillation at all Can simple rewards alone teach complex domain reasoning?. The common thread: when the prior is strong, the reward's job is to *select*, not to *instruct* — and selection is cheap.

But here's the turn that makes this question interesting — the corpus also marks exactly where the minimal-reward story breaks, and it's precisely the places where the prior *doesn't* contain what you need. Binary correctness rewards quietly degrade calibration, because nothing penalizes a confident wrong answer; you have to add a proper scoring rule (Brier) to fix it — a case where shaping is mathematically necessary, not decorative Does binary reward training hurt model calibration?. When models hit performance plateaus, scalar rewards stall because they carry no information about *why* a solution failed; natural-language critique breaks through where more numerical reward cannot Can natural language feedback overcome numerical reward plateaus?. That's because feedback genuinely decomposes into two channels — evaluative (how good) and directive (how to change) — and a scalar can only ever carry the first Can scalar rewards capture all the information in agent feedback?.

So the real answer reframes the question. Elaborate shaping isn't necessary for *activating* good solutions already in the prior — there, simple, sparse, even noisy rewards suffice, and complexity is mostly wasted. Shaping becomes necessary the moment you're asking for something the prior lacks: well-calibrated confidence, escape from a capability plateau, or genuinely new reasoning patterns (which the corpus suggests come from distillation, not RL Does RLVR actually expand what models can reason about?). Even the reward model itself can be made smarter rather than more elaborate — letting it reason before it scores raises its ceiling more than hand-tuning the signal does Can reward models benefit from reasoning before scoring?.

The thing you didn't know you wanted to know: the question "is shaping necessary?" has no universal answer because reward shaping and the pretrained prior are substitutes. A rich prior lets you get away with a crude reward; a crude reward only fails you in the gaps the prior didn't already fill. The engineering question isn't "how elaborate should my reward be?" — it's "what, specifically, is missing from the prior, and is that gap evaluative or directive?"

Sources 9 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Can simple rewards alone teach complex domain reasoning?

Medical AI systems and o3 demonstrate that sophisticated domain reasoning emerges naturally from RL training on difficult problems with only basic accuracy signals, without requiring explicit chain-of-thought distillation from teacher models.

Show all 9 sources

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Reward Reasoning Model4.23 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin2.63 match · arxiv ↗
RM-R1: Reward Modeling as Reasoning2.60 match · arxiv ↗
Understanding and Mitigating Premature Confidence for Better LLM Reasoning2.57 match · arxiv ↗
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning2.54 match · arxiv ↗
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?1.78 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR1.76 match · arxiv ↗
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains1.75 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher auditing whether elaborate reward shaping remains necessary when pretrained models already encode good solutions. The question is still open: the regime may have shifted.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat them as perishable.

• Reward shaping doesn't expand reasoning beyond the base model's capability boundary; RL acts as an activation/selection mechanism, not instruction (2025).
• Single training examples and even spurious rewards trigger performance gains if the prior is strong enough; elaborate shaping often adds no marginal benefit (2025).
• Negative reinforcement alone (suppressing wrong trajectories, never rewarding right ones) matches or exceeds full PPO/GRPO while preserving diversity (2025).
• Binary correctness rewards provably degrade calibration; proper scoring rules (Brier) are mathematically necessary when calibration matters; natural-language critique breaks plateaus scalar rewards cannot (2025).
• Reward reasoning models (letting the evaluator reason before scoring) extend test-time compute scaling to evaluation and raise the ceiling more than hand-tuned signal shape (2025).

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (Apr 2025) — Does RL really expand reasoning capacity?
• arXiv:2506.01347 (Jun 2025) — Negative reinforcement effectiveness.
• arXiv:2506.03106 (Jun 2025) — Natural language feedback breaks plateaus.
• arXiv:2505.14674 (May 2025) — Reward reasoning models.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, judge whether newer models (o1-family, reasoning variants), training method advances (continued scaling of verifiable RL, process-reward architectures), or evals (calibration benchmarks, reasoning-specific suites) have since relaxed or overturned it. Separate the durable question ("When does RL activate vs. instruct?") from the perishable limitation ("elaborate shaping is useless"). Cite what resolved it; flag where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does any recent paper argue that shaping *is* necessary, or that the prior-activation story breaks under realistic conditions?
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "If o1-class reasoning comes partly from supervised reasoning distillation, not RL, does that change the role of reward shape?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

If the base model already knows the answer, does it matter what reward signal you use to unlock it?

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8