SYNTHESIS NOTE

Does depth matter more than width for tiny language models?

Explores whether deep-and-thin architectures outperform wide-and-shallow ones at sub-billion scales, and why this might contradict larger-model scaling laws.

Synthesis note · 2026-05-03 · sourced from Mobile

Kaplan et al.'s scaling laws establish a roughly balanced relationship between model depth and width as parameters scale, with width growth often dominating at typical model sizes. MobileLLM demonstrates that this guidance breaks at the sub-billion-parameter scale relevant for on-device deployment. A deep-and-thin model structure outperforms balanced or wide-and-shallow alternatives, producing 2.7 percent and 4.3 percent accuracy boosts over preceding 125M and 350M state-of-the-art models respectively. The reason offered is that depth captures abstract concepts — composing simpler features into hierarchical representations through more layers — and at small scale the model has fewer raw parameters to spend, so making each one work harder through compositional depth pays back more than spreading them across wider layers.

This matters because it shows that scaling laws are regime-dependent rather than universal. The Kaplan results were derived from larger models where width and depth are both abundant; at the small scale where mobile deployment lives, the trade-offs reverse. The implication is that the architectural recipe for on-device LLMs is genuinely different from the recipe for cloud-scale LLMs — not just smaller, but structurally different. Can architecture choices improve inference efficiency without sacrificing accuracy? makes the same point at the inference-economics layer: vanilla scaling laws say nothing about deployment regimes.

The deeper lesson is methodological: scaling laws should always be qualified by the regime in which they were derived, and recommendations for sub-billion-parameter design should not be extrapolated downward from billion-plus-parameter studies. The right architecture for a 350M parameter model is not a scaled-down version of a 70B parameter model; it is a deep-and-thin model derived from the constraints of the small-scale regime. Can parallel architectures solve inherently sequential problems? gives a complementary reason to favor depth — some computations require sequential composition that width cannot supply at any scale.

Inquiring lines that read this note 111

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does depth matter more than width for tiny language models?

Inquiring lines that read this note 111

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4