SYNTHESIS NOTE

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR shows both real behavioral changes and inflated metrics. Can these contradictory findings actually describe the same phenomenon from different angles, and what does that mean for evaluating reasoning improvements?

Synthesis note · 2026-02-23 · sourced from Flaws

Two RLVR findings appeared contradictory:

Spurious rewards work: Why do random rewards improve reasoning for some models but not others?, suggesting the reward signal itself matters less than the RL training process, which activates latent pretraining capabilities. This was treated as evidence that RLVR functions as a pretraining catalyst rather than a reasoning teacher.

Benchmark contamination: Since Does RLVR success on math benchmarks reflect genuine reasoning improvement?, the metric improvement may be data memorization rather than genuine reasoning activation.

The resolution: These findings operate at different measurement levels and can coexist:

Behavioral activation (genuine): RL training with any reward signal activates code reasoning formats and structured thinking patterns that exist in pretraining data but are dormant. This is visible in output format changes, thinking token usage, and exploration behavior changes — measurements not contaminated by benchmark overlap.
Benchmark improvement (inflated): The metric improvement on contaminated benchmarks is partially or fully attributable to memorization. Clean benchmarks show reduced or eliminated gains for spurious rewards, while correct rewards still improve.

The practical implication: RLVR research must separate behavioral measurements (how the model's reasoning process changes) from performance measurements (how benchmark scores change). Both are informative; conflating them produces confusion about what RLVR actually does. The one-shot activation finding (single example triggers 36%→73.6% improvement) may itself need re-evaluation on clean benchmarks.

Inquiring lines that read this note 58

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Is model self-awareness based on genuine introspection or pattern matching?

What makes accountability and validity-orientation non-behavioral properties?

Why do benchmark improvements fail to reflect actual reasoning quality?

How do we evaluate AI systems when user perception misleads actual performance?

How does memorization interact with learning and generalization?

Can language model hallucination be prevented or only managed?

How much does ROUGE metric choice inflate hallucination detection claims?

Why can't humans reliably detect AI-generated text despite measurable linguistic signatures?

How does treating synthetic data as empirical evidence contaminate statistical inference?

What constrains reinforcement learning's ability to expand model reasoning?

How can identical external performance mask different internal representations?

Why do reasoning models fail at systematic problem-solving and search?

Can activation patching reveal which reasoning steps actually matter?

Do base models contain latent reasoning that training can unlock?

Do corrupted reasoning traces serve as effective supervision signals?

Why do invalid reasoning steps produce nearly the same performance gains?

Does reinforcement learning teach reasoning or just when to reason?

How do evaluation mechanisms prevent error accumulation in autonomous research systems?

Does alignment training create blind spots in detecting genuine safety threats?

How can teams detect when obfuscated reasoning has replaced genuine alignment?

Can single-axis benchmarks accurately predict agent deployment success?

What mechanisms enable AI systems to generate and spread false beliefs?

How can we detect dishonesty in model outputs separate from capability failures?

How does AI adoption affect human skill development and labor equality?

Why does accumulated portfolio output not match accumulated worker capability?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

What distinguishes genuine capability gains from coherent but invalid reasoning traces?

What actually drives chain-of-thought reasoning improvements in language models?

Does CoT reasoning actually cause the outputs that follow it?

Does externalizing cognitive work and state improve agent reliability?

Does inspectable skill artifacts guarantee the behavior matches the person it claims to ground?

Can genuine reasoning activation coexist with contaminated benchmarks?

Inquiring lines that read this note 58

Related papers in this collection 8

Search by related questions 5