INQUIRING LINE

Can one training example activate mathematical reasoning in RL-trained models?

This explores the surprising finding that a single training example can unlock math reasoning in models trained with reinforcement learning from verifiable rewards (RLVR) — and what that says about whether RL teaches reasoning or merely switches on ability the model already had.


This explores whether one training example can activate mathematical reasoning in RL-trained models, and the corpus has a striking direct answer: yes. A single example in RLVR can lift math performance from 36% to 73.6%, and — even stranger — test accuracy keeps climbing for 1,400 steps after training accuracy has already hit 100% (Can a single training example unlock mathematical reasoning?). That a lone example does so much is the tell: the model isn't learning math from that example. It's being switched on.

That reframing is the through-line across the collection. Several independent lines of work converge on the idea that base models already carry reasoning ability in latent form, and training merely elicits it. One synthesis catalogs five separate mechanisms — RL steering, critique fine-tuning, decoding tweaks, sparse-autoencoder feature steering, and RLVR — that all surface reasoning already sitting in base-model activations, concluding the bottleneck is elicitation, not capability (Do base models already contain hidden reasoning ability?). A related strand argues RL teaches *when* to reason rather than *how*: a hybrid model recovered 91% of the performance gains using only 12% of the tokens, suggesting RL acts as a deployment-timing optimizer (Does RL post-training create reasoning or just deploy it?, Does RL teach reasoning or just when to use it?). Pulled together, the 'one example' result stops looking like a fluke and starts looking like a prediction of the activation view (What does reward learning actually do to model reasoning?).

Here's the part you might not expect: if RL is mostly flipping a switch, then the *quality* of the reward signal should matter less than its existence — and that's exactly what shows up. Spurious rewards work nearly as well as correct ones for models with the right pretraining (What does reward learning actually do to model reasoning?). But that same finding carries a warning. On contaminated benchmarks, RLVR's apparent gains turn out to be memorization, not reasoning — one model reconstructed 54.6% of MATH-500 from partial prompts yet scored 0.0% on a post-release benchmark, and on clean data only genuinely correct rewards helped (Does RLVR success on math benchmarks reflect genuine reasoning improvement?). So 'one example activates reasoning' and 'rewards are just memorization' aren't contradictions; they describe what activation does and doesn't buy you.

The corpus also pushes back on the tidy 'RL only elicits, never expands' story. Prolonged RL with KL control, policy resetting, and non-mathematical tasks can discover genuinely novel strategies that base models can't reach at any sampling budget — outperforming them across all pass@k levels (Can reinforcement learning discover reasoning strategies base models cannot?). And activation alone doesn't guarantee correctness: RLVR measurably improves the coherence between adjacent reasoning steps without making the overall proof valid — locally smooth, globally wrong (Does RLVR actually improve mathematical reasoning or just coherence?). The single example wakes the reasoning up; it doesn't make the reasoning true.

If you want to follow the thread further, the corpus branches into how to make that activated reasoning useful: curriculum approaches that run imitation first to create rollouts worth sharpening (Does sequencing imitation then exploration training improve reasoning?), verifier-free reward signals for domains where answers can't be checked (Can reasoning improvement work without answer verification?), and the discovery that binary correctness rewards quietly wreck calibration by rewarding confident guessing (Does binary reward training hurt model calibration?). The one-example result is the door; behind it is a whole debate about whether we're growing reasoning or just learning to find the light switch.


Sources 11 notes

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does RL teach reasoning or just when to use it?

Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Can reasoning improvement work without answer verification?

VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about mathematical reasoning activation in RL-trained LLMs. The core question: does a single training example genuinely activate dormant reasoning, or does the effect dissolve under scrutiny as models improve, reward signals sophisticate, or evaluation hardens?

What a curated library found — and when (findings span Feb 2024–Dec 2025, treat as dated claims):
• One RLVR example lifts math performance 36% → 73.6%; test accuracy climbs 1,400 steps after training hits 100% (arXiv:2504.20571, ~Apr 2025).
• Five independent mechanisms — RL steering, critique fine-tuning, decoding, sparse autoencoders, RLVR — all surface pre-existing latent reasoning; bottleneck is elicitation not capability (synthesis, ~2025).
• RL teaches *when* to reason, not *how*; hybrid model recovered 91% gains with 12% tokens (arXiv:2510.07364, ~Oct 2025).
• On contaminated benchmarks, RLVR gains are memorization: one model scored 0% on clean data after 54.6% partial-prompt reconstruction (arXiv:2507.10532, ~Jul 2025).
• Prolonged RL with KL control and policy reset discovers genuinely novel strategies inaccessible to base models at any sampling budget (arXiv:2505.24864, ~May 2025).

Anchor papers (verify; mind their dates):
- arXiv:2504.20571 (Apr 2025): One-example RLVR spike.
- arXiv:2510.07364 (Oct 2025): When vs. how framing.
- arXiv:2507.10532 (Jul 2025): Contamination and memorization.
- arXiv:2505.24864 (May 2025): Prolonged RL novelty discovery.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the one-example claim, does scaling, improved pretraining, or post-hoc verifiers since Apr 2025 suggest activation now requires *multiple* examples or stronger reward signals? Does the 'when not how' thesis hold under newer model scales? On contaminated benchmarks: have evaluation standards tightened to isolate genuine reasoning from memorization, or does the 54% reconstruction → 0% clean gap persist? For prolonged RL: does it still discover novel strategies, or do modern base models already contain those paths? Separate the durable question from perishable limitations; cite what resolved each.
(2) Surface the strongest work from the last ~6 months that contradicts or supersedes the activation narrative — does any recent paper argue RL *does* add capability, not just flip a switch?
(3) Propose 2 research questions that assume the regime may have moved: one testing whether activation is *necessary* (can memorization + calibration defeat reasoning gain?) and one testing *sufficiency* (does activation without verifier feedback scale to harder domains?).

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines