INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›How do training data properties sh…›this inquiring line

One training example doubled an AI's math accuracy, then kept improving even after the model had fully memorized it.

Can a single correct example seed exponential improvement in mathematical reasoning?

This explores whether one correct example can trigger outsized gains in math reasoning — and the corpus suggests the gain is real but is better understood as *unlocking* latent ability than as compounding new skill from a single seed.

This explores whether one correct example can trigger outsized gains in math reasoning. The most direct evidence says yes: a single training example in RLVR lifted math accuracy from 36% to 73.6%, and — strikingly — test accuracy kept climbing for 1,400 steps after training accuracy already hit 100% (Can a single training example unlock mathematical reasoning?). That 'post-saturation generalization' is the closest thing here to your 'exponential' intuition: improvement that continues long after the model has nothing left to memorize from the example itself. A parallel result reaches the same place by a different road — fine-tuning on critiques of one problem's varied solutions unlocks comparable reasoning, no reinforcement learning needed (Can a single problem unlock reasoning through solution critique?).

But read the two together and the word that fits better than 'seed' is *activation*. Both papers frame the single example as a trigger for capability the base model already had, not as material the model learns from. The example isn't teaching multiplication — it's flipping a switch on a circuit that was already wired. This matters because it predicts where the trick stops working: you can only activate what's latent. A model that genuinely couldn't do the math wouldn't generalize for 1,400 steps from one problem.

The corpus then gets skeptical in a useful way. Several notes argue that big benchmark jumps can be mirages. RLVR's gains on contaminated benchmarks turn out to be largely memorization — one model reconstructed 54.6% of MATH-500 from partial prompts yet scored 0.0% on a clean post-release benchmark — and notably, *only correct rewards* improved clean performance, while random or inverted rewards did nothing or hurt (Does RLVR success on math benchmarks reflect genuine reasoning improvement?). So the 'single correct example' result survives this critique on one count (correctness of the signal does matter) but should make you ask whether the headline number reflects reasoning or leakage. Relatedly, RLVR can improve the *coherence* of reasoning traces — fewer logical jumps between adjacent steps — without making the overall proof valid (Does RLVR actually improve mathematical reasoning or just coherence?), and supervised fine-tuning can raise answer accuracy while *degrading* the quality of inferential steps by 38.9%, producing right answers through post-hoc rationalization (Does supervised fine-tuning improve reasoning or just answers?).

Here's the part you didn't know you wanted to know: a cluster of findings suggests reasoning training often teaches *form*, not inference. Logically invalid chain-of-thought exemplars perform nearly as well as valid ones (Does logical validity actually drive chain-of-thought gains?), and deliberately corrupted traces teach as well as correct ones — sometimes generalizing *better* out of distribution (Do reasoning traces need to be semantically correct?). Traces seem to act as computational scaffolding rather than meaningful steps. This sits in productive tension with the single-example result, which insists the example must be *correct*. The reconciliation: when you supply many traces, the model latches onto their shape; when you supply one activating signal under verifiable reward, correctness is what tells the model which latent behavior to switch on.

So the honest answer to 'exponential improvement from one example' is: real, large, and continuing past saturation — but it's the discharge of stored potential, not compounding growth from scratch, and it's bounded by what the base model already contains. If you want to go further on those limits, the corpus also notes that more reasoning isn't free — accuracy peaks then *declines* past a thinking-token threshold (Does more thinking time always improve reasoning accuracy?), and that knowledge and reasoning live in different network layers, which is why activating reasoning helps math but can quietly damage knowledge-heavy domains (Why does reasoning training help math but hurt medical tasks?).

Sources 9 notes

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Can a single problem unlock reasoning through solution critique?

Critique Fine-Tuning achieves reasoning activation comparable to RLVR using only one problem and teacher-generated critiques of varied solutions, with no reinforcement learning. This demonstrates that exposure to correct versus incorrect reasoning on a specific problem is the sufficient activation signal.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Show all 9 sources

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains2.58 match · arxiv ↗
Reinforcement Learning for Reasoning in Large Language Models with One Training Example2.52 match · arxiv ↗
LLMs can implicitly learn from mistakes in-context2.51 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR2.51 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin2.50 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools2.48 match · arxiv ↗
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!1.75 match · arxiv ↗
Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens1.74 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mathematical reasoning researcher probing whether single-example seeding genuinely unlocks exponential capability gains or merely activates latent capacity that decays under scrutiny.

What a curated library found — and when (dated claims, not current truth): These findings span July 2023 to October 2025.
• One correct training example in RLVR lifted math accuracy from 36% to 73.6%, with test accuracy climbing for 1,400 steps *after* training saturated at 100% — suggesting post-saturation generalization, not memorization (2025-04, arXiv:2504.20571).
• Critique fine-tuning on a single problem's varied solutions unlocks comparable reasoning gains without reinforcement learning (2025-06, arXiv:2506.03295).
• RLVR on contaminated benchmarks is primarily memorization: one model reconstructed 54.6% of MATH-500 from partial prompts yet scored 0% on clean post-release benchmarks; only *correct* rewards improved clean performance (2025-07, arXiv:2507.10532).
• RLVR improves trace *coherence* (fewer logical jumps) without guaranteeing trace *validity* (2025-10, arXiv:2510.18176).
• Logically invalid chain-of-thought exemplars and deliberately corrupted traces perform nearly as well as valid ones, sometimes generalizing better out-of-distribution (2023-07 & 2025-05, arXiv:2307.10573, arXiv:2505.13775).
• Accuracy peaks then declines beyond a critical thinking-token threshold (2025-06, arXiv:2506.04210).

Anchor papers (verify; mind their dates):
• arXiv:2504.20571 (Apr 2025) — RLVR with one training example
• arXiv:2506.03295 (Jun 2025) — Critique fine-tuning on one problem
• arXiv:2507.10532 (Jul 2025) — Contamination and memorization in RLVR
• arXiv:2510.18176 (Oct 2025) — RLVR trace validity vs. coherence

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For the single-example activation result, determine whether newer model scales, improved reward signals (DPO, grounded reward models, outcome supervision), or chain-of-thought compression (arXiv:2507.10532) have since *extended* the regime where one example suffices, or *narrowed* it (e.g., larger models need more diverse signals). Separately, probe whether the memorization critique (arXiv:2507.10532) definitively rules out the 1,400-step generalization as genuine reasoning, or whether "post-saturation" gains on *held-out* test splits remain robust. State where the constraint still holds.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last 6 months. Does any recent work show that single-example activation fails on harder domains, or that reasoning-layer steering (arXiv:2507.10532) requires multiple diverse signals?
(3) **Propose 2 research questions that ASSUME the regime may have moved:** (a) If one *correct* example activates reasoning, what is the minimum *diversity* of exemplar types (across problem families, difficulty, solution strategies) needed for that activation to generalize beyond the training domain? (b) Can activation steering + early-stopping (before the thinking-token cliff) preserve reasoning gains while avoiding the accuracy-decline penalty?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

One training example doubled an AI's math accuracy, then kept improving even after the model had fully memorized it.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8