SYNTHESIS NOTE

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Explores whether RLVR's apparent effectiveness with spurious rewards on contaminated benchmarks like MATH-500 represents actual reasoning gains or merely data memorization retrieval.

Synthesis note · 2026-02-23 · sourced from Flaws

The apparent success of RLVR with random, incorrect, or spurious reward signals on Qwen models may be an artifact of data contamination rather than evidence of genuine reasoning improvement.

The contamination evidence: prompting Qwen2.5-Math-7B with the first 60% of each MATH-500 question yields 54.6% exact-match reconstruction of the remaining 40% and 53.6% correct answers to these incomplete problems. On LiveMathBench — a benchmark released after Qwen2.5 — completion rate drops to 0.0%, consistent with Llama3.1-8B (3.8%/0.0% respectively). The model has memorized MATH-500.

On a fully clean benchmark (RandomCalculation — synthetic arithmetic expressions generated after Qwen's release): correct rewards deliver consistent gains surpassing the model's performance ceiling; random rewards make training highly unstable with no reliable improvement; inverse rewards rapidly erode mathematical reasoning ability.

This directly challenges Why do random rewards improve reasoning for some models but not others?. The prior interpretation — that any optimization pressure activates pretraining strategies — may confound two effects: genuine strategy activation (possible) and recall of memorized answers triggered by format-similar optimization (likely for contaminated benchmarks). On clean data, the "any reward works" finding evaporates for random and inverse signals.

The practical implication: RLVR research conclusions drawn from MATH-500 and similar benchmarks for Qwen models should be interpreted with caution. Reward engineering may matter more than the spurious-reward literature suggests — we were measuring memorization recovery, not reasoning improvement.

Inquiring lines that read this note 48

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How does memorization interact with learning and generalization?

Why do benchmark improvements fail to reflect actual reasoning quality?

Can language model hallucination be prevented or only managed?

How much does ROUGE metric choice inflate hallucination detection claims?

Can ensemble evaluation methods reduce bias more than single judges?

How can identical external performance mask different internal representations?

How can process reward models supervise complex reasoning traces?

What properties determine whether reward signals teach genuine reasoning?

What information do numerical rewards fail to provide for reasoning tasks?

What constrains reinforcement learning's ability to expand model reasoning?

How do training data properties shape reasoning capability development?

How do we evaluate AI systems when user perception misleads actual performance?

Can model confidence signals reliably improve reasoning quality and calibration?

What makes mathematically confident but incorrect answers resemble valid solution shapes?

Does reinforcement learning teach reasoning or just when to reason?

Can inference-time compute substitute for scaling up model parameters?

Can test-time scaling work through retrieval rather than reasoning?

Can single-axis benchmarks accurately predict agent deployment success?

Do base models contain latent reasoning that training can unlock?

What pretraining formats encode latent reasoning strategies that RLVR can surface?

How can AI systems learn from failures without cascading errors?

Can trustworthy scoring prevent persistent iteration from compounding errors?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 106 in 2-hop network ·medium cluster Open in graph ↗

Does RLVR success on math benchmarks reflect gen… Why do random rewards improve reasoning for some m… Why does RLVR work with completely random rewards? Does RLVR actually expand what models can reason a… How much of LLM few-shot ability comes from traini…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do random rewards improve reasoning for some models but not others? When RLVR training uses meaningless reward signals, some models gain reasoning improvements while others don't. What determines which models can benefit from optimization pressure without meaningful feedback?
directly challenged: the "code reasoning" activation story may be contamination-assisted memorization recall
Why does RLVR work with completely random rewards? RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.
writing angle that needs qualification: the reward may matter after all, on clean benchmarks
Does RLVR actually expand what models can reason about? Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
consistent: RLVR narrows rather than expands, and contamination inflates apparent gains
How much of LLM few-shot ability comes from training data? Do large language models genuinely learn from a few examples, or are they mostly recognizing patterns from their training data? This matters for understanding what LLMs can actually do.
broader contamination phenomenon: RLVR contamination is benchmark-specific memorization; task contamination challenges the entire few-shot evaluation paradigm

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Inquiring lines that read this note 48

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4