SYNTHESIS NOTE
Model Architecture and Internals Training, RL, and Test-Time Scaling

Why do larger models learn rare tasks better?

Does model size enable learning of infrequent, complex tasks through greater representational capacity, or through some other mechanism? Understanding this matters for deciding whether scaling or data design is the more efficient lever.

Synthesis note · 2026-06-03 · sourced from Reinforcement Learning

The standard story for why larger models acquire capabilities smaller ones lack is expressivity — bigger models can represent functions smaller ones cannot. This paper argues the real cause is usually different. A phenomenological argument shows power-law scaling already implies a regime where a smaller model fails to learn part of a data mixture a larger model succeeds on, even with infinite training data — so the gap is not about whether a solution is representable.

The mechanism is reduced interference, traced through a controlled synthetic mixture and validated by pretraining OLMo models (4M–4B) on tasks of varying frequency and complexity. Smaller models face a data-induced competition over neurons: they allocate resources to high-frequency, low-complexity tasks and learn solutions that perform poorly on rare, complex tasks — even when an expressible solution exists. A larger model circumvents this because, with enough capacity allocated to common tasks, the gradient updates for those tasks become weak — so they stop overwriting the rare-task features that accumulate slowly over training.

The keeper implication overturns the "just scale parameters" reflex: understanding scaling requires thinking beyond expressivity to learning dynamics — task frequency and complexity interacting with capacity. And it suggests a cheaper lever: intentional data-mixture design. Simply up-weighting the frequency of a target rare task may teach it more efficiently than scaling model size. This connects to What limits reasoning capability beyond math and code?: both relocate capability from model size toward data composition.

Inquiring lines that use this note as a source 4

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 166 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

larger models learn rare tasks through reduced interference not greater expressivity — capacity weakens common-task gradients so they stop overwriting rare-task features