Can a single training example unlock mathematical reasoning?

Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.

Synthesis note · 2026-02-22 · sourced from RLVR

A single training example in RLVR is sufficient to produce dramatic mathematical reasoning improvement — MATH500 performance jumps from 36.0% to 73.6% for Qwen2.5-Math-1.5B. This matches the performance of training on the 1.2k DeepScaleR subset. Two examples slightly exceed both (74.8%). The pattern replicates across model families (Qwen, Llama, DeepSeek), RL algorithms (GRPO, PPO), and different math examples.

The most striking phenomenon is post-saturation generalization: training accuracy on the single example rapidly reaches 100%, yet test accuracy continues to improve for approximately 1,400 more training steps. The model has perfectly memorized its one example but keeps getting better at unseen problems. Even after eventual overfitting — when training outputs become "incomprehensible multilingual gibberish mixed with correct solutions" — test performance and output interpretability remain strong.

This finding is the extreme case of Do base models already contain hidden reasoning ability?. One example is not teaching reasoning — it is providing the minimal activation signal for the RL optimization process to reshape the sampling distribution. The entropy loss component encourages diverse output exploration, while the single training example acts as "implicit regularization" — punishing explorations that fail on the learned data, thereby providing verification for exploration.

Cross-domain generalization also emerges: a single math example improves performance on problems from different mathematical subdomains. Self-reflection frequency increases spontaneously during training, with words like "rethink," "recheck," and "recalculate" appearing more frequently — the model develops metacognitive behaviors from a single data point.

Since Can models improve themselves on tasks without verifiable answers?, the 1-shot result pushes the minimum viable dataset even further: not 1,000 demonstrations, but one.

Inquiring lines that read this note 37

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do training data properties shape reasoning capability development?

How do training priors constrain what context information can override?

Can prompting inject entirely new knowledge into language models?

How does example difficulty affect learning efficiency in language models?

How can process reward models supervise complex reasoning traces?

Can solution traces substitute for process-level reward signals in math reasoning?

How do neural networks separate factual knowledge from reasoning abilities?

Why do medical and mathematical tasks require fundamentally different model capabilities?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

Does logical trace coherence guarantee valid mathematical reasoning?

What properties determine whether reward signals teach genuine reasoning?

What information do numerical rewards fail to provide for reasoning tasks?

Why do benchmark improvements fail to reflect actual reasoning quality?

Does reinforcement learning teach reasoning or just when to reason?

Why does finetuning cause catastrophic forgetting of model capabilities?

How tight should a textual learning rate be before it prevents skill escape?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Why does the order of training examples matter for what models learn?

How does memorization interact with learning and generalization?

What is the theoretical capacity limit before memorization saturates?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 135 in 2-hop network ·medium cluster Open in graph ↗

Can a single training example unlock mathematica… Do base models already contain hidden reasoning ab… Can models improve themselves on tasks without ver… Does RL teach reasoning or just when to use it? Does reflection in reasoning models actually corre…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
1-shot RLVR is the most extreme confirmation
Can models improve themselves on tasks without verifiable answers? Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?
1-shot pushes the frontier far beyond 1000
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
post-saturation generalization shows the learning continues beyond the data
Does reflection in reasoning models actually correct errors? When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
1-shot RLVR spontaneously increases self-reflection frequency

Can a single training example unlock mathematical reasoning?

Inquiring lines that read this note 37

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4