Can genuine reasoning activation coexist with contaminated benchmarks?
RLVR shows both real behavioral changes and inflated metrics. Can these contradictory findings actually describe the same phenomenon from different angles, and what does that mean for evaluating reasoning improvements?
Two RLVR findings appeared contradictory:
Spurious rewards work: Why do random rewards improve reasoning for some models but not others?, suggesting the reward signal itself matters less than the RL training process, which activates latent pretraining capabilities. This was treated as evidence that RLVR functions as a pretraining catalyst rather than a reasoning teacher.
Benchmark contamination: Since Does RLVR success on math benchmarks reflect genuine reasoning improvement?, the metric improvement may be data memorization rather than genuine reasoning activation.
The resolution: These findings operate at different measurement levels and can coexist:
Behavioral activation (genuine): RL training with any reward signal activates code reasoning formats and structured thinking patterns that exist in pretraining data but are dormant. This is visible in output format changes, thinking token usage, and exploration behavior changes — measurements not contaminated by benchmark overlap.
Benchmark improvement (inflated): The metric improvement on contaminated benchmarks is partially or fully attributable to memorization. Clean benchmarks show reduced or eliminated gains for spurious rewards, while correct rewards still improve.
The practical implication: RLVR research must separate behavioral measurements (how the model's reasoning process changes) from performance measurements (how benchmark scores change). Both are informative; conflating them produces confusion about what RLVR actually does. The one-shot activation finding (single example triggers 36%→73.6% improvement) may itself need re-evaluation on clean benchmarks.
Inquiring lines that use this note as a source 53
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes accountability and validity-orientation non-behavioral properties?
- What other hidden biases might aggregate metrics fail to distinguish from reasoning?
- How should we redesign benchmarks to catch conservative bias in reasoning tasks?
- Does good simulation eventually count as genuine realization?
- How much RLVR improvement comes from benchmark data memorization?
- Can clean benchmarks reveal true RLVR reasoning gains?
- How much does ROUGE metric choice inflate hallucination detection claims?
- How does treating synthetic data as empirical evidence contaminate statistical inference?
- Why do current RLVR methods fail to expand reasoning capability beyond base model boundaries?
- Why do benchmark designers treat content effects as confounds?
- Can reasoning benchmarks separate logic from believability?
- Can activation patching reveal which reasoning steps actually matter?
- How do surface correlations between narratives and answers mislead benchmark validity?
- How much do metric choices inflate claims about model capabilities?
- How do weight perturbations reveal what performance benchmarks cannot measure?
- How much reasoning catalyst data is actually needed for improvement?
- Does the replication crisis in psychology predict similar failures in machine behavior research?
- Why do invalid reasoning steps produce nearly the same performance gains?
- Should benchmarks measure trace length or whether constraints were actually satisfied?
- Why does combining reasoning distillation with RLVR outperform either training stage alone?
- What design changes if we separate behavior description from adoption justification goals?
- Does RLVR reward structure create pressure toward traces that look right?
- Are RLVR models worse than non-reasoning models for subjective annotation?
- What role do high-entropy minority tokens play in RLVR?
- How does task contamination differ from test set data leakage?
- Why do benchmark scores rise while reasoning quality declines?
- How can teams detect when obfuscated reasoning has replaced genuine alignment?
- How does tool access change what we measure in reasoning tests?
- Does RLVR expand model capability or reorganize existing capability?
- How do satisfaction scores differ from genuine cognitive improvement?
- Why do benchmark scores not capture the true nature of AI systems?
- How can we detect dishonesty in model outputs separate from capability failures?
- What deployment context determines which benchmark mode actually matters?
- Can benchmark improvements hide degradation of deliberative reasoning?
- Why does accumulated portfolio output not match accumulated worker capability?
- What makes a trajectory score interpretable across different interactive benchmarks?
- Why does medium difficulty outperform both easy and hard RLVR training samples?
- What distinguishes genuine capability gains from coherent but invalid reasoning traces?
- What evaluation methods actually measure reasoning versus execution capability?
- How does RPT compare to learning when versus how to deploy reasoning?
- Why do certain tokens at certain difficulties drive most of RLVR's learning signal?
- How much of MATH-500 improvement comes from data contamination versus real reasoning gains?
- Does RLVR teach new reasoning or activate existing pretraining capabilities?
- What pretraining formats encode latent reasoning strategies that RLVR can surface?
- Does careful reward engineering matter if pretraining determines RLVR effectiveness?
- Can combining SRL with RLVR outperform either method used alone?
- Does CoT reasoning actually cause the outputs that follow it?
- What training regimes confound surface mechanisms with their actual causes?
- Why do AI benchmarks show rapid saturation from near-zero to near-perfect?
- How do open-world evaluations correct distortions that automated benchmarks introduce?
- How do live human evaluations differ from ground-truth benchmarks?
- Why do static benchmarks miss frontier capabilities that open-world tasks reveal?
- What real-world tasks most clearly expose gaps between benchmark performance and actual capability?
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- Spurious Rewards: Rethinking Training Signals in RLVR
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR
- Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?
- Absolute Zero: Reinforced Self-play Reasoning with Zero Data
- Eliciting Reasoning in Language Models with Cognitive Tools
Original note title
RLVR behavioral activation and benchmark improvement are separable — genuine pretraining activation can coexist with contamination-inflated metrics