SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals Reasoning, Retrieval, and Evaluation

Can a single training example unlock mathematical reasoning?

Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.

Synthesis note · 2026-02-22 · sourced from RLVR
How should researchers navigate LLM reasoning research? What does reward learning actually do to model reasoning?

A single training example in RLVR is sufficient to produce dramatic mathematical reasoning improvement — MATH500 performance jumps from 36.0% to 73.6% for Qwen2.5-Math-1.5B. This matches the performance of training on the 1.2k DeepScaleR subset. Two examples slightly exceed both (74.8%). The pattern replicates across model families (Qwen, Llama, DeepSeek), RL algorithms (GRPO, PPO), and different math examples.

The most striking phenomenon is post-saturation generalization: training accuracy on the single example rapidly reaches 100%, yet test accuracy continues to improve for approximately 1,400 more training steps. The model has perfectly memorized its one example but keeps getting better at unseen problems. Even after eventual overfitting — when training outputs become "incomprehensible multilingual gibberish mixed with correct solutions" — test performance and output interpretability remain strong.

This finding is the extreme case of Do base models already contain hidden reasoning ability?. One example is not teaching reasoning — it is providing the minimal activation signal for the RL optimization process to reshape the sampling distribution. The entropy loss component encourages diverse output exploration, while the single training example acts as "implicit regularization" — punishing explorations that fail on the learned data, thereby providing verification for exploration.

Cross-domain generalization also emerges: a single math example improves performance on problems from different mathematical subdomains. Self-reflection frequency increases spontaneously during training, with words like "rethink," "recheck," and "recalculate" appearing more frequently — the model develops metacognitive behaviors from a single data point.

Since Can models improve themselves on tasks without verifiable answers?, the 1-shot result pushes the minimum viable dataset even further: not 1,000 demonstrations, but one.

Inquiring lines that use this note as a source 37

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 131 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

one training example is sufficient to activate mathematical reasoning in rlvr — post-saturation generalization continues after training accuracy reaches 100 percent