How much RLVR improvement comes from benchmark data memorization?
This explores whether RLVR's reported gains are real reasoning improvements or just the model regurgitating benchmark answers it already memorized — and how to tell the two apart.
This explores whether RLVR's reported gains are real reasoning improvements or just the model regurgitating benchmark answers it already memorized. The corpus is unusually direct on this: a large chunk of headline RLVR improvement on popular math benchmarks is contamination, not reasoning. The sharpest evidence comes from a study where Qwen2.5-Math-7B could reconstruct 54.6% of MATH-500 from partial prompts — meaning the answers were baked into pretraining — yet scored 0.0% on a benchmark released *after* the model was trained Does RLVR success on math benchmarks reflect genuine reasoning improvement?. On those clean, post-release benchmarks, the picture changes completely: only genuinely correct rewards help, while random or inverted rewards do nothing or actively degrade reasoning. That before/after gap is the cleanest single estimate of how much 'improvement' was memorization.
The useful twist is that memorization and real activation aren't mutually exclusive — they're measured at different levels and can both be happening at once Can genuine reasoning activation coexist with contaminated benchmarks?. So the right question isn't 'is RLVR fake?' but 'which slice of the number is contamination?' This reframes a lot of the surrounding work. Even when RLVR isn't memorizing, what it adds is often narrower than the benchmark score implies: it improves the *coherence* of reasoning traces (fewer logical jumps between adjacent steps) without guaranteeing the proof is globally valid Does RLVR actually improve mathematical reasoning or just coherence?, and it sharpens sampling efficiency rather than expanding what the model can actually solve — base models still beat RLVR models at high pass@k Does RLVR actually expand what models can reason about?.
The deepest clue about why memorization can masquerade as reasoning comes from the 'spurious rewards' line: Qwen2.5-Math gains 16–25% on MATH-500 from *random or incorrect* rewards, while Llama and OLMo gain nothing Why do random rewards improve reasoning for some models but not others?. If a meaningless reward signal produces double-digit gains, the gain can't be teaching new reasoning — it's surfacing latent behavior (and, on contaminated data, latent answers) already present from pretraining. The broader synthesis What does reward learning actually do to model reasoning? makes the pattern explicit: a single training example can suffice for activation, and spurious rewards work nearly as well as correct ones for models with the right pretraining — exactly the fingerprint you'd expect if much of the lift is recall plus format selection rather than learning.
That 'format selection' angle adds one more layer worth knowing: RL doesn't broadly rewrite the model — it amplifies one dominant pretraining format while suppressing alternatives Does RL training collapse format diversity in pretrained models?, touching only 5–30% of parameters in sparse, full-rank subnetworks Does reinforcement learning update only a small fraction of parameters?. A process that narrow is structurally better at *retrieving* what's already there than at building new capability — which is the mechanistic reason memorized benchmark content gets so efficiently re-exposed.
The corpus doesn't give a single universal percentage — and it shouldn't, because the contamination share depends on the model and benchmark. But the takeaway you might not have gone looking for: the cleanest way to measure RLVR's *real* contribution is to test it on benchmarks released after the model's training cutoff. When researchers do that, gains shrink dramatically, the 'random rewards work too' effect vanishes, and what survives is modest — better sampling and tidier traces, not expanded reasoning Does RLVR actually expand what models can reason about?.
Sources 8 notes
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.
RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.