Why do random rewards improve reasoning for some models but not others?
When RLVR training uses meaningless reward signals, some models gain reasoning improvements while others don't. What determines which models can benefit from optimization pressure without meaningful feedback?
RLVR improves MATH-500 performance for Qwen2.5-Math-7B by 21.4% with random rewards, 16.4% with format-only rewards, 24.6% with incorrect labels, and 24.4% with 1-shot RL — nearly matching the 28.8% gained with ground truth rewards. The reward signal appears almost irrelevant to the outcome.
But these spurious rewards fail entirely for Llama3 and OLMo2 model families. The critical variable is not the reward but the pretraining strategy. Qwen2.5-Math develops a distinctive "code reasoning" behavior — thinking in code without execution — that rises from 66.7% to over 90% frequency after RLVR, even with spurious rewards. Other model families lack this particular latent strategy.
This is perhaps the strongest evidence for Does RL teach reasoning or just when to use it?. If random rewards work as well as correct rewards for specific models, then RLVR's function is not to provide direction but to provide pressure. The optimization signal — any optimization signal — activates preexisting reasoning strategies encoded during pretraining. The reward is a catalyst, not a teacher.
Since Does training data format shape reasoning strategy more than domain?, the Qwen code-reasoning strategy is a pretraining format artifact. RLVR surfaces it; the specific reward signal is incidental to the surfacing. Models without that pretraining format cannot benefit from the same activation pressure.
The practical implication is sobering: RLVR effectiveness may be almost entirely determined before RLVR training begins. The investment in careful reward engineering may be less important than the investment in pretraining data composition.
Critical challenge: data contamination. The RandomCalculation paper directly challenges the "any reward works" interpretation. Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 problems from partial prompts (first 60%); on post-release LiveMathBench this drops to 0.0%. On a fully clean benchmark of synthetic arithmetic (guaranteed to post-date model release), random rewards produce unstable training with no reliable improvement, while correct rewards deliver consistent gains surpassing the model's ceiling. This means the benchmark gains that motivated the "reward doesn't matter" narrative may be substantially inflated by memorization. The code-reasoning behavior change (66.7% → 90%+) is real and not explained by contamination alone — but the headline finding requires significant qualification. See Does RLVR success on math benchmarks reflect genuine reasoning improvement? for the full contamination argument and ops/tensions/rlvr-spurious-rewards-work-vs-rlvr-gains-are-data-contamination-artifacts.md for the tension analysis.
Reweave 2026-05-18 — the catalyst framing applies when reward is misaligned but structured rewards can still teach. The original "reward is a catalyst, not a teacher" framing remains correct for the specific case spurious rewards study: when the pretrained prior already contains the target capability (Qwen's code-reasoning), almost any optimization pressure surfaces it. But late-2025 evidence sharpens the scope of this claim. Can reward models learn by comparing policies instead of judging them? shows that structured rewards — POLAR's similarity-to-target-policy — provide a genuinely directional signal that does carry information beyond catalysis. The distinction is:
- Reward as catalyst (this note's framing): applies when the reward signal is misaligned or random and the prior provides the structure. The reward provides pressure without direction; the prior provides direction. Spurious rewards work in this regime.
- Reward as relational signal (POLAR's framing): applies when the reward is structurally aligned with what should be learned. Similarity-to-target IS direction. The signal carries information.
These coexist because they describe different regimes. The "any reward works" finding tells you what happens when the prior dominates; POLAR tells you what happens when the reward form is itself structured to carry the lesson. The general framing: rewards that lack structure rely on the prior; rewards with structure carry independent information.
This connects to the broader Can language models replace reward models with internal signals? convergence — the five verifier-free patterns each provide structured signal (not random), and their substitutability is consistent with the prior dominating within the structured-signal regime. The spurious-rewards finding is a different observation about what happens when signal is absent.
Inquiring lines that use this note as a source 25
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do spurious rewards activate reasoning without teaching new skills?
- What behavioral changes occur during reward learning training?
- How much RLVR improvement comes from benchmark data memorization?
- Can clean benchmarks reveal true RLVR reasoning gains?
- Why do spurious reward signals improve reasoning for some pretrained models?
- Can random rewards improve reasoning models if pretraining is suitable?
- Does negative reinforcement alone achieve what full RL training accomplishes?
- Does RLVR reward structure create pressure toward traces that look right?
- Why do spurious rewards work nearly as well as correct ones?
- What role do high-entropy minority tokens play in RLVR?
- Why do different models respond differently to spurious rewards?
- Why do spurious rewards work for some models but not others?
- When does outcome reward signal become informative during model training?
- How do reward signals in RLVR interact with pretraining biases?
- How does 93% reward reliability compare to other RL noise sources?
- What happens when variance in reward signals comes from a noisy model?
- Why does medium difficulty outperform both easy and hard RLVR training samples?
- Can the same variance signal work as both reward and query filter?
- Are different reward signal sources substitutable in verifier-free RL?
- Why do six different RLVR algorithms converge on similar performance levels?
- How does prolonged RL training differ from standard RLVR approaches?
- Why do certain tokens at certain difficulties drive most of RLVR's learning signal?
- Does RLVR teach new reasoning or activate existing pretraining capabilities?
- Does careful reward engineering matter if pretraining determines RLVR effectiveness?
- What makes binary rewards more effective than richer reward signals?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
spurious rewards are the strongest confirmation that RL teaches timing not capability
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
code reasoning as pretraining format artifact explains model-specificity
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
any reward pressure unlocks latent strategies
-
Do reasoning traces need to be semantically correct?
Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.
parallel: corrupted inputs can still yield gains
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Spurious Rewards: Rethinking Training Signals in RLVR
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
- Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
- Eliciting Reasoning in Language Models with Cognitive Tools
- Reward Reasoning Model
- RM-R1: Reward Modeling as Reasoning
Original note title
spurious rewards with no correlation to correct answers still improve rlvr reasoning — but only for models with specific pretraining strategies