Does depth matter more than width for tiny language models?
Explores whether deep-and-thin architectures outperform wide-and-shallow ones at sub-billion scales, and why this might contradict larger-model scaling laws.
Kaplan et al.'s scaling laws establish a roughly balanced relationship between model depth and width as parameters scale, with width growth often dominating at typical model sizes. MobileLLM demonstrates that this guidance breaks at the sub-billion-parameter scale relevant for on-device deployment. A deep-and-thin model structure outperforms balanced or wide-and-shallow alternatives, producing 2.7 percent and 4.3 percent accuracy boosts over preceding 125M and 350M state-of-the-art models respectively. The reason offered is that depth captures abstract concepts — composing simpler features into hierarchical representations through more layers — and at small scale the model has fewer raw parameters to spend, so making each one work harder through compositional depth pays back more than spreading them across wider layers.
This matters because it shows that scaling laws are regime-dependent rather than universal. The Kaplan results were derived from larger models where width and depth are both abundant; at the small scale where mobile deployment lives, the trade-offs reverse. The implication is that the architectural recipe for on-device LLMs is genuinely different from the recipe for cloud-scale LLMs — not just smaller, but structurally different. Can architecture choices improve inference efficiency without sacrificing accuracy? makes the same point at the inference-economics layer: vanilla scaling laws say nothing about deployment regimes.
The deeper lesson is methodological: scaling laws should always be qualified by the regime in which they were derived, and recommendations for sub-billion-parameter design should not be extrapolated downward from billion-plus-parameter studies. The right architecture for a 350M parameter model is not a scaled-down version of a 70B parameter model; it is a deep-and-thin model derived from the constraints of the small-scale regime. Can parallel architectures solve inherently sequential problems? gives a complementary reason to favor depth — some computations require sequential composition that width cannot supply at any scale.
Inquiring lines that use this note as a source 97
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do negative item weights matter more than model depth?
- What structural constraints matter more than model depth for CF?
- Why does frame-activation matter more than word-by-word composition?
- Can input augmentation and rephrasing compensate for smaller model limitations?
- How do embedding dimension limits constrain what concept models can represent?
- Do modern architectures in NLP and vision rely on dot products intentionally?
- What compression explains why syntax fits in low-dimensional subspaces?
- How does nesting optimization levels improve on traditional network depth?
- Does architectural discovery follow an empirical scaling law like neural networks?
- Can adaptive compute allocation at sub-token granularity improve cross-lingual robustness?
- Why are polysemantic features concentrated in early neural network layers?
- Do larger models develop more abstract features than smaller ones?
- Does scaling model size solve compositional generalization problems?
- Does the model learn depth-wise drift as an explicit strategy?
- How does inference compute substitution affect the training parameter scaling trade-off?
- How do sub-token and architecture-level compute optimization strategies compare?
- Why do hierarchical architectures better implement the deep research definition?
- How does weight sharing compound the advantages of deeper model designs?
- Why do scaling laws fail to predict optimal architectures at small parameter counts?
- Can sequential computation through depth solve problems that parallel width cannot?
- Why do power-law distributions make standard ML infrastructure assumptions fail?
- Does compositional generalization emerge suddenly or improve smoothly with scale?
- Can smaller models actually perform well on specific downstream tasks?
- Can lower embedding dimensions alone solve the diversity problem without attention mechanisms?
- Can neural networks implement genuine algorithms or only statistical pattern matching?
- How do you measure the depth of political representation inside a language model?
- Why do large language models still have systematic blind spots with complex structures?
- Which linguistic abilities are learnable from human-sized data exposure?
- Why does depth outperform width for sub-billion parameter models?
- What mobile hardware constraints force the sub-billion parameter regime?
- How do conditional scaling laws incorporate hardware into architecture choices?
- How does adjacent layer sharing differ from non-adjacent weight reuse?
- Can smaller specialist models outperform large generalist models on domain tasks?
- Why does adjusted compression performance degrade as models scale larger?
- Does bidirectional attention improve language models as universal encoders?
- Do language models encode deep syntactic structure or only surface-level patterns?
- Can encoder models match human conceptual structure better than larger language models?
- How does the Ladder of Scales approach reduce search costs across model sizes?
- Can depth scaling and breadth scaling unlock independent capability axes?
- Why does exploration quality matter more than learner network depth?
- How much do structural inductive biases matter compared to training data volume?
- What scaling laws govern autonomous architecture discovery in AI systems?
- Can transformers reason beyond fixed architectural depth limits?
- Why do smaller models favor code formats while larger models prefer natural language?
- Why do language models reproduce human EPA structure despite different architecture?
- Can bounded-depth transformers solve inherently sequential problems?
- Do small models show different parameter efficiency patterns than large models?
- How should tiny language models be architected differently than large ones?
- Which architectural choices matter most when a model must fit one billion parameters?
- How do parallel sampling and sequential depth compare as scaling dimensions?
- Why might diverse smaller models with routing beat one giant model?
- Can scaling alone create compositional generalization without explicit binding mechanisms?
- What inductive biases help networks segregate entities from raw inputs?
- Does directional knowledge failure indicate shallow pattern matching over deep representation?
- Why does weight sparsity reduce superposition and force disentangled representations?
- What sparse high-rank patterns does the deep tower fail to capture?
- What makes a small surgical wide component sufficient with a capable deep model?
- How does training distribution shape what language models understand best?
- What makes modernized N-gram embeddings composable with transformer architectures?
- How do corpus statistics shape the abstraction hierarchy in language model representations?
- Can small transformers trained on similarity maps replace dense retrievers entirely?
- Why does scaling data and model size improve compositional generalization?
- Why should deep learning theory prioritize average-case over worst-case analysis?
- Do pretrained language models carry reusable computational scaffolding for length handling?
- Which hyperparameter theories best explain universal behaviors across neural networks?
- What solvable idealized settings reveal fundamental phenomena in realistic deep learning?
- Can deterministic recurrent depth achieve the computational benefits of stochastic reasoning?
- Why do hybrid memory and compute sparsity outperform pure parameter scaling?
- What are the scaling law differences between vision and language learning?
- Does sparsity enforce compositional structure or merely amplify existing modularity?
- Why do language models plateau at constraint satisfaction regardless of scale?
- Can width-scaling replace depth-scaling on inherently sequential problems?
- How should benchmark design account for task-dependent sparsity tolerance differences?
- How do models develop dense representations for familiar training data?
- Why do vision and language have different optimal scaling curves?
- What does a human-parseable framework for deep learning look like?
- Do generic kernel-decay assumptions alone explain coarse-to-fine spectral ordering?
- Can latent recurrence overcome the trainability costs of depth?
- What architectural alternatives can capture compositional structure beyond pooled cosine?
- How does modality-specific sparsity enable capacity flexibility that dense models cannot provide?
- Why should scaling laws be understood as properties of data distribution rather than training in general?
- What geometric structure do language models actually use during inference?
- Do KANs maintain their advantages in deep architectures and large-scale training?
- Why do naive pruning and quantization destroy LLM performance so easily?
- Why does recursion on latent state drive generalization better than hierarchy?
- What makes recurrent depth enable compositional generalization across tasks?
- Why is latent-level prediction more sample-efficient than token-level prediction?
- What makes looped latent computation more efficient than scaling attention capacity?
- Can intentional data-mixture design replace model scaling for rare task learning?
- Can a two-layer network outgeneralize billion-parameter models through recursion alone?
- What power-law scaling patterns emerge when consistency models are trained at scale?
- What empirical evidence supports the Learning Law on real language models?
- Why do small specialized models match frontier multimodal models on screen tasks?
- Can spiking sparsity replace weight quantization as a primary efficiency lever?
- Why does architecture matter more than training compute for inference efficiency?
- Can attention linearity achieve similar efficiency gains as weight quantization?
- Can architectural changes reduce representational inequality in unified generators?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
What actually limits language models on mobile phones?
Is the shift toward smaller LLMs driven by quality trade-offs, or by hard physical constraints on device memory and battery life? This note examines whether sub-billion models are a practical necessity rather than a compromise.
extends: same MobileLLM source; this note answers WHY sub-billion is the regime, depth-vs-width answers HOW to design within it
-
Does recomputing weights cost less than moving them on mobile?
Explores whether mobile hardware's memory bottleneck makes it cheaper to recompute transformer blocks than to fetch their weights twice, and whether this trades accuracy for efficiency.
extends: same MobileLLM paper; depth wins partly because depth-with-shared-weights can be deeper than depth-with-distinct-weights at fixed parameter count; the two design moves compound
-
Can architecture choices improve inference efficiency without sacrificing accuracy?
Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
extends: both reject regime-blind scaling laws; this note shows depth-width trade-offs flip in the small regime; conditional scaling laws formalize how architecture variables modulate the law
-
Can parallel architectures solve inherently sequential problems?
Complexity theory suggests some problems like reasoning and planning are fundamentally sequential. Can parallel architectures like Transformers overcome this limitation, or do we need fundamentally different computational approaches?
extends: gives a theoretical reason to prefer depth (serial composition) over width (parallel breadth) for capability-bounded models
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
- Hierarchical Reasoning Model
- Scaling Laws for Neural Language Models
- Nested Learning: The Illusion of Deep Learning Architectures
- Bigger is not always better: The importance of human-scale language modeling for psycholinguistics
- Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
- Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
- Large Language Diffusion Models
Original note title
depth beats width for sub-billion parameter LLMs — contradicting Kaplan scaling laws because deep-and-thin captures abstract concepts better at small scale