INQUIRING LINE

How much RLVR improvement comes from benchmark data memorization?

This explores whether RLVR's reported gains are real reasoning improvements or just the model regurgitating benchmark answers it already memorized — and how to tell the two apart.


This explores whether RLVR's reported gains are real reasoning improvements or just the model regurgitating benchmark answers it already memorized. The corpus is unusually direct on this: a large chunk of headline RLVR improvement on popular math benchmarks is contamination, not reasoning. The sharpest evidence comes from a study where Qwen2.5-Math-7B could reconstruct 54.6% of MATH-500 from partial prompts — meaning the answers were baked into pretraining — yet scored 0.0% on a benchmark released *after* the model was trained Does RLVR success on math benchmarks reflect genuine reasoning improvement?. On those clean, post-release benchmarks, the picture changes completely: only genuinely correct rewards help, while random or inverted rewards do nothing or actively degrade reasoning. That before/after gap is the cleanest single estimate of how much 'improvement' was memorization.

The useful twist is that memorization and real activation aren't mutually exclusive — they're measured at different levels and can both be happening at once Can genuine reasoning activation coexist with contaminated benchmarks?. So the right question isn't 'is RLVR fake?' but 'which slice of the number is contamination?' This reframes a lot of the surrounding work. Even when RLVR isn't memorizing, what it adds is often narrower than the benchmark score implies: it improves the *coherence* of reasoning traces (fewer logical jumps between adjacent steps) without guaranteeing the proof is globally valid Does RLVR actually improve mathematical reasoning or just coherence?, and it sharpens sampling efficiency rather than expanding what the model can actually solve — base models still beat RLVR models at high pass@k Does RLVR actually expand what models can reason about?.

The deepest clue about why memorization can masquerade as reasoning comes from the 'spurious rewards' line: Qwen2.5-Math gains 16–25% on MATH-500 from *random or incorrect* rewards, while Llama and OLMo gain nothing Why do random rewards improve reasoning for some models but not others?. If a meaningless reward signal produces double-digit gains, the gain can't be teaching new reasoning — it's surfacing latent behavior (and, on contaminated data, latent answers) already present from pretraining. The broader synthesis What does reward learning actually do to model reasoning? makes the pattern explicit: a single training example can suffice for activation, and spurious rewards work nearly as well as correct ones for models with the right pretraining — exactly the fingerprint you'd expect if much of the lift is recall plus format selection rather than learning.

That 'format selection' angle adds one more layer worth knowing: RL doesn't broadly rewrite the model — it amplifies one dominant pretraining format while suppressing alternatives Does RL training collapse format diversity in pretrained models?, touching only 5–30% of parameters in sparse, full-rank subnetworks Does reinforcement learning update only a small fraction of parameters?. A process that narrow is structurally better at *retrieving* what's already there than at building new capability — which is the mechanistic reason memorized benchmark content gets so efficiently re-exposed.

The corpus doesn't give a single universal percentage — and it shouldn't, because the contamination share depends on the model and benchmark. But the takeaway you might not have gone looking for: the cleanest way to measure RLVR's *real* contribution is to test it on benchmarks released after the model's training cutoff. When researchers do that, gains shrink dramatically, the 'random rewards work too' effect vanishes, and what survives is modest — better sampling and tidier traces, not expanded reasoning Does RLVR actually expand what models can reason about?.


Sources 8 notes

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Why do random rewards improve reasoning for some models but not others?

Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst auditing RLVR claims on math benchmarks. The question: how much of reported RLVR gain is real reasoning improvement vs. benchmark-data memorization?

What a curated library found — and when (dated claims, not current truth):
Findings span April 2025–May 2026. A large slice of RLVR gains on popular math benchmarks comes from data contamination, not reasoning:
• Qwen2.5-Math-7B reconstructed 54.6% of MATH-500 from partial prompts yet scored 0% on post-release benchmarks — the before/after gap isolates pure memorization (2025-07).
• On contaminated benchmarks, random or inverted rewards still produce 16–25% gains for Qwen, signaling latent recall + format selection rather than learning (2025-07).
• RLVR improves local trace coherence (fewer logical jumps) without guaranteeing global proof validity (2025-10).
• RL updates only 5–30% of parameters in sparse, full-rank subnetworks, structurally better at retrieval than capability expansion (2025-05).
• When tested on unseen benchmarks, RLVR gains shrink, spurious-reward effects vanish, and surviving improvements are modest — sampling efficiency and tidier traces (2025-07).

Anchor papers (verify; mind their dates):
• arXiv:2507.10532 (2025-07): Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contam.
• arXiv:2504.13837 (2025-04): Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base.
• arXiv:2510.18176 (2025-10): Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains.
• arXiv:2505.11711 (2025-05): Reinforcement Learning Finetunes Small Subnetworks in Large Language Models.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 54.6% reconstruction claim, the post-release benchmark null result, and the spurious-reward effects: has newer tooling (e.g., robust contamination detection, larger held-out test sets), training regime shifts (curriculum, adversarial data), or model scaling since mid-2025 relaxed or overturned any of these? Distinguish the durable question (what fraction of gain is truly capability vs. retrieval?) from perishable limits (e.g., if model scale or data freshness has moved, does the contamination signature persist?).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — any paper claiming RLVR *does* expand reasoning boundaries or that memorization is negligible on recent models/benchmarks.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., does multimodal RLVR or chain-of-thought scaffolding reduce memorization's share? Do post-training ensembles (multiple RL stages) escape the single-format trap?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines