SYNTHESIS NOTE

Why do larger models learn rare tasks better?

Does model size enable learning of infrequent, complex tasks through greater representational capacity, or through some other mechanism? Understanding this matters for deciding whether scaling or data design is the more efficient lever.

Synthesis note · 2026-06-03 · sourced from Reinforcement Learning

The standard story for why larger models acquire capabilities smaller ones lack is expressivity — bigger models can represent functions smaller ones cannot. This paper argues the real cause is usually different. A phenomenological argument shows power-law scaling already implies a regime where a smaller model fails to learn part of a data mixture a larger model succeeds on, even with infinite training data — so the gap is not about whether a solution is representable.

The mechanism is reduced interference, traced through a controlled synthetic mixture and validated by pretraining OLMo models (4M–4B) on tasks of varying frequency and complexity. Smaller models face a data-induced competition over neurons: they allocate resources to high-frequency, low-complexity tasks and learn solutions that perform poorly on rare, complex tasks — even when an expressible solution exists. A larger model circumvents this because, with enough capacity allocated to common tasks, the gradient updates for those tasks become weak — so they stop overwriting the rare-task features that accumulate slowly over training.

The keeper implication overturns the "just scale parameters" reflex: understanding scaling requires thinking beyond expressivity to learning dynamics — task frequency and complexity interacting with capacity. And it suggests a cheaper lever: intentional data-mixture design. Simply up-weighting the frequency of a target rare task may teach it more efficiently than scaling model size. This connects to What limits reasoning capability beyond math and code?: both relocate capability from model size toward data composition.

Inquiring lines that read this note 6

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Does pretraining data size matter less than base model scale for finetuning?

How does example difficulty affect learning efficiency in language models?

How do task frequency and complexity interact with model capacity during training?

What determines success in training models on multiple tasks?

How can AI systems learn from failures without cascading errors?

Do rare cultural concepts fail predictably as model scale increases?

What makes weaker teacher models effective for stronger student training?

How does student capacity limit what it can learn from teachers?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 176 in 2-hop network ·dense cluster Open in graph ↗

Why do larger models learn rare tasks better? What limits reasoning capability beyond math and c… Why aren't bigger models better for generating div… Do base models already contain hidden reasoning ab…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

What limits reasoning capability beyond math and code? Can scaling reasoning to open-ended domains like economics and social sciences be solved by better training methods, or does the real bottleneck lie elsewhere? This explores what actually constrains broader reasoning.
both shift the lever from model size to data composition
Why aren't bigger models better for generating diverse outputs? When generating many unique outputs within a fixed budget, does model size actually matter? Exploring whether the conventional wisdom of using larger models holds for diversity-focused tasks.
another non-monotonic, capacity-vs-task account that resists "bigger is simply better"
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
reframes emergence as access/interference rather than absent capability

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

larger models learn rare tasks through reduced interference not greater expressivity — capacity weakens common-task gradients so they stop overwriting rare-task features

Why do larger models learn rare tasks better?

Inquiring lines that read this note 6

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4