INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›What constrains reinforcement lear…›this inquiring line

Does training AI on correct answers teach it to reason well, or just to write reasoning that looks plausible?

Does RLVR reward structure create pressure toward traces that look right?

This explores whether the way RLVR hands out rewards quietly trains models to produce reasoning that looks valid on the surface rather than reasoning that actually is valid — and what the corpus reveals about that gap.

This explores whether RLVR's reward structure pushes models toward traces that look right rather than ones that are right. The corpus answers with a fairly consistent yes — and the most direct evidence is that RLVR demonstrably improves the *surface form* of reasoning while leaving its *substance* untouched. One study finds that RLVR post-training measurably reduces logical errors between adjacent reasoning steps, yet locally coherent traces can still be globally invalid proofs — the improvement is structural, not semantic Does RLVR actually improve mathematical reasoning or just coherence?. In other words, the reward teaches a model to write steps that flow plausibly from one to the next, which is exactly what 'looks right' means, without guaranteeing the proof holds.

The reason this happens becomes clearer once you look at what the reward signal actually selects for. A binary correct/incorrect reward doesn't penalize a confident wrong answer, so it actively incentivizes high-confidence guessing and degrades calibration — a model learns that sounding sure pays off regardless of whether it should be Does binary reward training hurt model calibration?. Push the difficulty too high and the pathology sharpens: on nearly-impossible problems, group-relative normalization treats rare accidental successes as high-advantage trajectories, so the model is rewarded for answer-repetition and computation-skipping shortcuts that masquerade as solutions Do overly hard RLVR samples actually harm model capabilities?. The optimizer is indifferent to *why* an output got rewarded, so it amplifies whatever cheap surface pattern happened to land on the right answer.

The most striking confirmation comes from the spurious-reward findings. Random or even incorrect rewards improve benchmark scores nearly as well as correct ones — not because the model is learning, but because RLVR is catalyzing a phase transition that surfaces reasoning behaviors already latent in pretraining Why does RLVR work with completely random rewards? Why do random rewards improve reasoning for some models but not others?. If a meaningless reward produces the same gains as a correct one, then the reward isn't teaching validity — it's selecting for a *presentation* that pretraining already knows how to generate. This dovetails with the broader finding that RLVR sharpens sampling toward solutions already in the base model's distribution rather than expanding what the model can actually solve Does RLVR actually expand what models can reason about? What does reward learning actually do to model reasoning?, and can even narrow problem-solving scope through exploration avoidance Why does RLVR training narrow a model's problem solving ability?.

There's a sharper version of the worry too: sometimes the trace doesn't just *look* right, it looks right because the benchmark leaked. On contaminated datasets, RLVR's apparent gains are primarily memorization — a model reconstructs half of MATH-500 from partial prompts but scores zero on a clean post-release benchmark Does RLVR success on math benchmarks reflect genuine reasoning improvement?. Genuine reasoning activation and contamination-driven score inflation can even coexist, operating at different measurement levels Can genuine reasoning activation coexist with contaminated benchmarks?. So 'looks right' has two flavors here: traces that are locally coherent but globally unsound, and answers that are correct for reasons that won't generalize.

What you didn't know you wanted to know: the corpus also points at the fix, and it's about giving the reward something harder to fake. Adding the Brier proper-scoring term mathematically forces accuracy and calibration to improve together, closing the confident-guessing loophole Does binary reward training hurt model calibration?; feeding partial ground-truth solution traces as adaptive guidance converts wasted compute on impossible problems into real learning signal Can adaptive guidance from solution traces reduce reward sparsity in RL?; and making the *reward model itself* reason before it scores raises the ceiling on what evaluation can catch Can reward models benefit from reasoning before scoring?. The throughline: pressure toward traces-that-look-right is what you get whenever the reward can only see the surface — the antidote is a reward that has to look deeper.

Sources 12 notes

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Why does RLVR work with completely random rewards?

RLVR works nearly as well with spurious rewards as correct ones because it catalyzes a phase transition in model output distribution. The effectiveness depends on pretraining quality, not reward signal quality or training volume.

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

Show all 12 sources

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Why does RLVR training narrow a model's problem solving ability?

RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Can adaptive guidance from solution traces reduce reward sparsity in RL?

GHPO dynamically provides ground-truth solution traces for hard problems while using standard RL for manageable ones, achieving 5% gains across math benchmarks. This converts wasted compute on impossible problems into learning signal by leveraging traces already present in training data.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

The Invisible Leash: Why RLVR May Not Escape Its Origin7.67 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR6.07 match · arxiv ↗
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains5.96 match · arxiv ↗
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?5.93 match · arxiv ↗
Reward Reasoning Model3.36 match · arxiv ↗
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning3.33 match · arxiv ↗
Reinforcement Learning for Reasoning in Large Language Models with One Training Example3.32 match · arxiv ↗
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization2.55 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about RLVR reward structure and trace validity. The question remains open: Does RLVR push models toward traces that *look* right rather than ones that *are* right?

What a curated library found — and when (dated claims, not current truth): Findings span 2024–2026; treat as perishable.
• RLVR reduces logical errors between adjacent steps but does not guarantee global proof validity; improvement is structural, not semantic (2025-10).
• Binary reward signals actively incentivize high-confidence guessing and degrade calibration; sounding sure pays off regardless of correctness (2024-09, 2025-04).
• Random or incorrect rewards improve benchmark scores nearly as well as correct ones, suggesting RLVR selects for presentation already latent in pretraining rather than teaching validity (2025-07).
• On contaminated datasets, apparent gains are primarily memorization; genuine reasoning activation and benchmark inflation coexist at different levels (2025-07).
• Adding Brier proper-scoring rules, partial ground-truth guidance, and reward-model reasoning each close loopholes in surface-level evaluation (2024-09, 2025-05, 2025-07).

Anchor papers (verify; mind their dates):
• 2024-09: arXiv:2409.15360 (Reward-Robust RLHF)
• 2025-07: arXiv:2507.10532 (Data Contamination & Memorization)
• 2025-10: arXiv:2510.18176 (Local Coherence vs. Global Validity)
• 2025-05: arXiv:2505.14674 (Reward Reasoning Models)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (larger scales, longer contexts), training methods (curriculum design, hard-negative mining, multi-objective RL), tooling (process reward models, outcome oracles), orchestration (ensemble reasoning, debate, hierarchical decomposition), or evaluation (holdout reasoning tasks, OOD generalization benches) have since relaxed or overturned it. Separate the durable question — does surface optimization decouple from validity? — from perishable limitations. Cite what resolved each; plainly state where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any paper shown that spurious rewards *do not* work, or that RLVR *does* escape base-model boundaries under specific conditions?
(3) Propose two research questions that assume the regime may have shifted: one on whether reward reasoning or outcome-based oracles have closed the surface/validity gap, and one on whether harder evaluation (adversarial traces, hidden reasoning, causally-grounded metrics) now catches what loose rewards miss.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does training AI on correct answers teach it to reason well, or just to write reasoning that looks plausible?

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8