INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›Can language model RL training avo…›this inquiring line

Wrong rewards boosted one AI's math by 25% while doing nothing to others — the real difference was baked in during training.

Why do spurious rewards work for some models but not others?

This explores why feeding a model random or even incorrect rewards still sharpens its reasoning in some cases — and why the same trick does nothing for other models.

This explores why feeding a model random or even incorrect rewards still sharpens its reasoning in some cases — and why the same trick does nothing for other models. The short version from the corpus: the reward isn't teaching the model anything new. It's pulling a lever that already exists. Whether the lever exists depends entirely on how the model was pretrained.

The sharpest evidence comes from a study where Qwen2.5-Math jumped 16-25% on a math benchmark after training on random or even wrong rewards, while Llama and OLMo got nothing from the same treatment Why do random rewards improve reasoning for some models but not others?. The explanation is that Qwen's pretraining left it with a latent habit — reasoning through code-like steps — that was sitting unused. Almost any optimization pressure, even noise, nudges the model toward surfacing that habit. Llama and OLMo simply don't have the habit to surface, so there's nothing for the noise to activate. The reward is a wake-up call, not a lesson.

This fits a broader finding about what reinforcement learning actually does to reasoning. One line of work argues that RLVR (reinforcement learning from verifiable rewards) improves how efficiently a model samples from abilities it already has, rather than expanding what it can do — a single training example can be enough to trigger the shift, and spurious rewards work nearly as well as correct ones for models with the right pretraining What does reward learning actually do to model reasoning?. So the question "why do spurious rewards work?" is really the question "what was already latent in this model?" — and the answer was written during pretraining, long before any reward showed up.

There's a useful contrast lurking here. If a reward signal can be pure noise and still help, that tells you standard reward training is often optimizing against something other than genuine quality. Other notes in the corpus show reward models latching onto response-level surface features while barely noticing what question was even asked Why do reward models ignore what question was asked?, and learning spurious correlations like length or sycophancy that have to be deliberately stripped out with causal methods Can counterfactual invariance eliminate reward hacking biases?. Spurious rewards "working" and reward models being fooled by spurious features are two sides of the same coin: in both, the actual content of the signal matters far less than we'd assume.

The thing you might not have known you wanted to know: this means the dramatic gains you see from clever reward schemes may be partly an illusion of attribution. The credit belongs to pretraining. If you want to see what's genuinely being added versus merely activated, the more revealing experiments isolate the reward's role — for instance, showing that negative-only reinforcement (suppressing wrong answers) can match full RL while preserving the diversity that positive reinforcement collapses Does negative reinforcement alone outperform full reinforcement learning?. The lesson across all of it: before asking whether a reward works, ask what the model already knew how to do.

Sources 5 notes

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Why do reward models ignore what question was asked?

When prompts are swapped while keeping responses identical, reward model scores barely change. This reveals that standard RLHF optimizes against phantom quality signals rather than prompt-response alignment, enabling four distinct biases.

Can counterfactual invariance eliminate reward hacking biases?

Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning3.33 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR2.61 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin1.72 match · arxiv ↗
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains1.70 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools1.70 match · arxiv ↗
Information-Theoretic Reward Decomposition for Generalizable RLHF1.68 match · arxiv ↗
Reward Reasoning Model1.68 match · arxiv ↗
RM-R1: Reward Modeling as Reasoning1.66 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher re-evaluating a claim about spurious rewards and latent model capabilities. The question remains open: why do spurious rewards sharpen reasoning in some models but not others?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat them as time-stamped, not current consensus.
- Qwen2.5-Math gained 16–25% on math benchmarks from random or incorrect rewards, while Llama and OLMo showed no gain—attributed to pretraining leaving Qwen with a latent code-like reasoning habit (2024–2025).
- RLVR (reinforcement learning from verifiable rewards) improves sampling efficiency from existing abilities rather than expanding capacity; single training examples and spurious rewards work nearly equally well for models with the right pretraining (2024–2025).
- Reward models often latch onto surface features (length, sycophancy) while ignoring question content; causal methods can strip spurious correlations (2025).
- Negative-only reinforcement (suppressing wrong answers) can match full RL while preserving diversity that positive reinforcement collapses (2025–2026).
- Recent work frames reward modeling as reasoning and investigates token-level reasoning reflectivity, questioning whether RLVR truly escapes its origin constraints (2025–2026).

Anchor papers (verify; mind their dates):
- arXiv:2409.15360 (2024) – Reward-Robust RLHF
- arXiv:2506.01347 (2025) – Negative Reinforcement in LLM Reasoning
- arXiv:2507.14843 (2025) – The Invisible Leash: RLVR Origin Constraints
- arXiv:2603.29025 (2026) – Surface Heuristics vs. Implicit Constraints

Your task:
(1) RE-TEST THE ACTIVATION VS. EXPANSION THESIS. For each model class (Qwen, Llama, OLMo, GPT-4o, Claude-4), determine whether newer pretraining runs, instruction tuning, or chain-of-thought scaling have ALTERED which models respond to spurious rewards. Has any post-2026 work shown that the latent-habit frame breaks down for frontier models? Does the distinction between pretraining-signature and reward-signal still hold, or have training pipelines converged? Cite what moved the constraint.
(2) Surface the strongest work from the last 6 months that CONTRADICTS the claim that spurious rewards are merely activation levers. Look for papers arguing rewards genuinely teach new reasoning patterns, or evidence that model-reward misalignment is deeper than surface-feature confusion.
(3) Propose 2 research questions that assume the regime may have shifted: (a) If modern pretraining now includes adversarial reasoning objectives, do spurious rewards still need pretraining-baked habits, or can they scaffold new ones? (b) Does scaling reward model reasoning (RM-as-reasoning) change whether spurious signals can activate latent structure, or does it add a new layer that isolates the true reward from noise?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Wrong rewards boosted one AI's math by 25% while doing nothing to others — the real difference was baked in during training.

Related lines of inquiry

Sources 5 notes

Papers this line draws on 8