INQUIRING LINE

How should tiny language models be architected differently than large ones?

This explores how the design choices for very small (sub-billion-parameter) language models should diverge from the scale-everything playbook used for frontier models — not just shrinking, but rethinking shape, training, and role.


This reads the question as asking what genuinely changes when you build a tiny model rather than a large one — and the corpus suggests the answer starts with physics, not preference. Sub-billion-parameter models exist because phones can't host anything bigger: a 7B model drains a 50kJ battery in under two hours, while a 350M model can run conversational AI all day on the same device, so DRAM and battery — not a quality compromise — dictate the size class What actually limits language models on mobile phones?. Once that constraint is fixed, the interesting design questions begin.

The sharpest architectural finding is that the scaling intuitions from large models invert. For frontier models, balanced width-vs-depth tradeoffs (Kaplan-style) hold; at the tiny scale they don't. MobileLLM shows deep-and-thin beats short-and-wide by 2.7–4.3% at 125M–350M, because stacking more layers lets the model compose abstract concepts through depth rather than memorize across a wide parameter sheet Does depth matter more than width for tiny language models?. So the first concrete answer: spend your scarce parameter budget on layers, not channels.

The second shift is in how you train, not just shape. A small model left to imitate outputs (SFT) tends to fail on rigid format-sensitive tasks like function calling. But training it on a large teacher's correct *and* incorrect examples via DPO — giving it explicit negative examples to push away from — lets small models match much larger ones on logical and mathematical function-calling tasks Can small models match large models on function calling?. The lesson is that tiny models lean harder on distillation and preference signal to recover what they can't brute-force from scale.

The third shift is about role, and it reframes the whole question. You may not need one tiny model to do everything. Most agentic work is repetitive, well-defined language tasks that a small model handles at 10–30× lower cost, which makes heterogeneous systems — small models by default, large ones called only when needed — the economically rational architecture Can small language models handle most agent tasks?. So 'architect the tiny model differently' partly means architect the *system* differently: build a fleet, not a monolith.

What won't save a tiny model is also worth knowing. Several ceilings the corpus identifies are scale-independent — constraint-satisfaction plateaus at ~55–60% regardless of parameter count Do larger language models solve constrained optimization better?, and models pattern-match instead of running real iterative computation no matter how big they get Do large language models actually perform iterative optimization?. Going small loses you little on tasks where going large already gains nothing. And for the one place size usually matters most — long context — the bottleneck turns out to be compute to consolidate context into state, not raw memory Is long-context bottleneck really about memory or compute?, while decoupled neural-memory designs like Titans handle 2M+ tokens by separating cheap long-term storage from expensive attention Can neural memory modules scale language models beyond attention limits?. For a tiny model, that points to a parting design principle: stop trying to make attention do everything, and move long-horizon work into separate, cheaper modules.


Sources 8 notes

What actually limits language models on mobile phones?

Smartphones' DRAM budgets and battery capacity make sub-billion-parameter models the only sustainable option for mobile deployment. A 7B model drains a 50kJ battery in under two hours, while a 350M model can run conversational AI for a full day on the same device.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher auditing tiny-model architecture claims against the current frontier. The question remains open: **How should sub-billion-parameter language models be architected differently than large ones?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026 and include:
- Depth beats width for tiny models: 2.7–4.3% gains at 125M–350M params by stacking layers over channels, inverting Kaplan scaling intuitions (~2024).
- DPO training on teacher errors lets small models match larger ones on function-calling and reasoning tasks (~2024).
- Heterogeneous agentic systems—routing simple tasks to small models (10–30× cost savings)—are economically rational (~2025).
- Constraint-satisfaction plateaus at 55–60% regardless of scale; iterative computation remains out of reach (~2024–2025).
- Long-context bottleneck is compute to consolidate context into state; neural-memory modules (Titans) decouple cheap storage from expensive attention (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2402.14905 (MobileLLM, Feb 2024)
- arXiv:2410.18890 (Small-model function-calling via DPO, Oct 2024)
- arXiv:2506.02153 (Small models in agentic systems, Jun 2025)
- arXiv:2501.00663 (Titans, Jan 2025)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For depth-vs-width, DPO effectiveness, agentic routing, and constraint-satisfaction ceilings: have newer scaling laws, training methods (RL beyond DPO), or multi-agent orchestration patterns since resolved or shifted these limits? Separate the durable question (small-model efficiency under hardware constraints) from perishable findings (specific architectural trade-offs). Cite what resolved each.
(2) **Surface strongest contradicting work from the last ~6 months.** If recent papers show wide-shallow models regaining parity, or constraint-satisfaction broken by novel reasoning architectures, name them and explain the disagreement.
(3) **Propose 2 research questions that assume the regime may have moved:** e.g., "Do small models trained on synthetic multi-step reasoning data sidestep the depth-vs-width inversion?" or "Can adaptive compute-optimal routing at inference time make monolithic tiny models viable again?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines