How should tiny language models be architected differently than large ones?
This explores how the design choices for very small (sub-billion-parameter) language models should diverge from the scale-everything playbook used for frontier models — not just shrinking, but rethinking shape, training, and role.
This reads the question as asking what genuinely changes when you build a tiny model rather than a large one — and the corpus suggests the answer starts with physics, not preference. Sub-billion-parameter models exist because phones can't host anything bigger: a 7B model drains a 50kJ battery in under two hours, while a 350M model can run conversational AI all day on the same device, so DRAM and battery — not a quality compromise — dictate the size class What actually limits language models on mobile phones?. Once that constraint is fixed, the interesting design questions begin.
The sharpest architectural finding is that the scaling intuitions from large models invert. For frontier models, balanced width-vs-depth tradeoffs (Kaplan-style) hold; at the tiny scale they don't. MobileLLM shows deep-and-thin beats short-and-wide by 2.7–4.3% at 125M–350M, because stacking more layers lets the model compose abstract concepts through depth rather than memorize across a wide parameter sheet Does depth matter more than width for tiny language models?. So the first concrete answer: spend your scarce parameter budget on layers, not channels.
The second shift is in how you train, not just shape. A small model left to imitate outputs (SFT) tends to fail on rigid format-sensitive tasks like function calling. But training it on a large teacher's correct *and* incorrect examples via DPO — giving it explicit negative examples to push away from — lets small models match much larger ones on logical and mathematical function-calling tasks Can small models match large models on function calling?. The lesson is that tiny models lean harder on distillation and preference signal to recover what they can't brute-force from scale.
The third shift is about role, and it reframes the whole question. You may not need one tiny model to do everything. Most agentic work is repetitive, well-defined language tasks that a small model handles at 10–30× lower cost, which makes heterogeneous systems — small models by default, large ones called only when needed — the economically rational architecture Can small language models handle most agent tasks?. So 'architect the tiny model differently' partly means architect the *system* differently: build a fleet, not a monolith.
What won't save a tiny model is also worth knowing. Several ceilings the corpus identifies are scale-independent — constraint-satisfaction plateaus at ~55–60% regardless of parameter count Do larger language models solve constrained optimization better?, and models pattern-match instead of running real iterative computation no matter how big they get Do large language models actually perform iterative optimization?. Going small loses you little on tasks where going large already gains nothing. And for the one place size usually matters most — long context — the bottleneck turns out to be compute to consolidate context into state, not raw memory Is long-context bottleneck really about memory or compute?, while decoupled neural-memory designs like Titans handle 2M+ tokens by separating cheap long-term storage from expensive attention Can neural memory modules scale language models beyond attention limits?. For a tiny model, that points to a parting design principle: stop trying to make attention do everything, and move long-horizon work into separate, cheaper modules.
Sources 8 notes
Smartphones' DRAM budgets and battery capacity make sub-billion-parameter models the only sustainable option for mobile deployment. A 7B model drains a 50kJ battery in under two hours, while a 350M model can run conversational AI for a full day on the same device.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.