INQUIRING LINE

Why do spurious reward signals improve reasoning for some pretrained models?

This explores why random or incorrect reward signals can still boost reasoning in certain pretrained models — and why the same trick does nothing for others.


This explores why random or incorrect reward signals can still boost reasoning in certain pretrained models — and why the same trick does nothing for others. The short answer the corpus keeps circling back to: the reward isn't teaching anything new. It's flipping a switch on behavior the model already learned during pretraining. When researchers gave Qwen2.5-Math rewards with zero correlation to correct answers — even random ones — it gained 16-25% on MATH-500, while Llama and OLMo gained nothing Why do random rewards improve reasoning for some models but not others?. The difference wasn't the reward; it was that Qwen had absorbed latent code-reasoning patterns during pretraining that the optimization pressure could surface. The pretraining format determines what's there to be activated.

This reframes what reinforcement learning is actually doing. The dominant story is that RLVR (reinforcement learning from verifiable rewards) elicits rather than creates — it improves how efficiently a model samples from strategies it already has, without pushing past its capability boundary, and a single training example can be enough to trigger the activation What does reward learning actually do to model reasoning?. That fits a broader finding from five independent lines of work — RL steering, critique fine-tuning, decoding tweaks, feature steering, and RLVR all unlock reasoning already sitting in base-model activations Do base models already contain hidden reasoning ability?. Post-training selects; it doesn't build. So a spurious reward works for the same reason a correct one does: both are just nudges that bias sampling toward latent good behavior, and if that behavior exists, even a noisy nudge finds it.

That also explains the asymmetry you'd otherwise find baffling. If the model has no latent reasoning strategy to surface, there's nothing for the reward — correct or spurious — to amplify, which is why Llama and OLMo flatline. The capability ceiling is set in pretraining; reward signal quality mostly governs whether you reach it, not how high it is.

There's a subtler mechanism worth knowing here. Part of why even uninformative rewards help may be that what's doing the work isn't the positive signal at all. Training on only negative samples — suppressing wrong trajectories — matches or beats full RL, because it preserves answer diversity while positive-only reinforcement collapses probability mass onto a few paths Does negative reinforcement alone outperform full reinforcement learning?. If a chunk of the benefit comes from pruning bad paths rather than rewarding good ones, you'd expect rewards loosely tied (or untied) to correctness to still do something useful.

The honest caveat: this is a property of the model, not a free lunch. The same literature warns that reward quality matters enormously once you care about more than benchmark accuracy — binary correctness rewards quietly wreck calibration by encouraging confident guessing Does binary reward training hurt model calibration?, and standard training can't tell causal quality signals from spurious correlated ones unless you force the distinction Can counterfactual invariance eliminate reward hacking biases?. So spurious rewards 'working' is really a diagnostic: it tells you the reasoning was pretrained in and the reward is just an activation key — which is a very different thing from the reward being good.


Sources 6 notes

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can counterfactual invariance eliminate reward hacking biases?

Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning capability analyst. The question remains open: Why do spurious reward signals improve reasoning for some pretrained models but not others?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026 and include:
• Qwen2.5-Math gained 16–25% on MATH-500 from zero-correlation (random) rewards; Llama and OLMo gained nothing, suggesting pretraining format, not reward quality, determines whether reasoning latencies exist to surface (2025–2026).
• RLVR (reinforcement learning from verifiable rewards) elicits rather than creates — it improves sampling efficiency from strategies already learned during pretraining without expanding capability boundaries (2025).
• Negative reinforcement alone (suppressing wrong trajectories) matches or exceeds full RL, implying much benefit comes from pruning bad paths, not rewarding good ones (2025–2026).
• Binary correctness rewards degrade calibration by encouraging confident guessing; causal reward modeling via counterfactual invariance is needed to distinguish causal from spuriously correlated signals (2025).

Anchor papers (verify; mind their dates):
• arXiv:2506.01347 (2025-06): The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
• arXiv:2501.09620 (2025-01): Beyond Reward Hacking — Causal Rewards for LLM Alignment
• arXiv:2510.18176 (2025-10): Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
• arXiv:2507.14843 (2025-07): The Invisible Leash — Why RLVR May Not Escape Its Origin

Your task:
(1) RE-TEST THE LATENCY THESIS. For each finding above, judge whether newer models (e.g., o1-style chain-of-thought scaling, frontier models post-2025), training methods (e.g., synthetic trajectory generation, multi-turn RL), or evaluation harnesses have since RELAXED the pretraining-format constraint or overturned the model-specificity asymmetry (Qwen vs. Llama). Separate: Is reasoning truly latent in base models, or does post-training sometimes BUILD capability? Cite what resolved or confirmed this.
(2) Surface the strongest CONTRADICTING work from the last ~6 months: papers showing spurious rewards fail universally, or that reward quality DOES determine ceiling height, not just sampling efficiency.
(3) Propose 2 research questions that ASSUME the regime may have moved — e.g., do foundation model scaling laws now make pretraining latency moot? Can you force reasoning creation via adversarial reward design?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines