SYNTHESIS NOTE

What actually limits language models on mobile phones?

Is the shift toward smaller LLMs driven by quality trade-offs, or by hard physical constraints on device memory and battery life? This note examines whether sub-billion models are a practical necessity rather than a compromise.

Synthesis note · 2026-05-03 · sourced from Mobile

The push to build sub-billion-parameter LLMs is often framed as a quality-cost trade-off, but MobileLLM's framing is sharper: at current device specs the larger models simply cannot run sustainably on phones. Modern smartphones have 6 to 12 GB of DRAM (iPhone 15 has 6 GB, Pixel 8 Pro has 12 GB), and any single app should use no more than 10 percent of DRAM because memory is shared with the OS and other apps. An 8-bit-quantized LLaMA 7B exceeds this budget. Energy is the second binding constraint: at roughly 0.1 joules per token per billion parameters, a 7B model consumes 0.7 J/token, and a fully charged iPhone with about 50 kJ of energy can sustain that model for less than two hours of conversation at 10 tokens per second — every 64 tokens drains 0.2 percent of the battery.

These numbers reframe sub-billion LLMs as the only practical regime for mobile deployment rather than as a compromise. A 350M 8-bit model at 0.035 J/token can support conversational use for a full day on the same battery, and a 125M model can run at 50 tokens per second on-device versus 3 to 6 tokens per second for the LLaMA 7B running through MLC Chat. The decoding speed advantage compounds the energy advantage — faster generation means less time the system stays in high-power inference state.

The macro-scale argument is also striking: deploying GPT-4-class models for the daily AI usage of every individual would require around 100 million H100 GPUs at 60 TFLOPS each, equivalent to roughly 160 Meta-scale companies. Mobile inference is not just a UX preference; it is the energy-feasible path to ubiquitous LLM use. The constraint flips the design question: instead of "how big can we make this model," the right question becomes the one Does depth matter more than width for tiny language models? and Does recomputing weights cost less than moving them on mobile? both answer — how should a model under one billion parameters be architected for the regime it must run in.

Inquiring lines that read this note 9

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can model routing outperform monolithic scaling as an efficiency strategy?

Can routing enable heterogeneous SLM-first architectures at scale?

When does architectural design matter more than raw model capacity?

How does example difficulty affect learning efficiency in language models?

Can smaller models actually perform well on specific downstream tasks?

How should personalization be implemented to improve AI assistant effectiveness?

How do input length constraints reshape personalization system design choices?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Why does recomputing weights cost less than moving them on phones?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 109 in 2-hop network ·medium cluster Open in graph ↗

What actually limits language models on mobile p… Does depth matter more than width for tiny languag… Does recomputing weights cost less than moving the… Can small language models handle most agent tasks? Can architecture choices improve inference efficie… Can inference compute replace scaling up model siz…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does depth matter more than width for tiny language models? Explores whether deep-and-thin architectures outperform wide-and-shallow ones at sub-billion scales, and why this might contradict larger-model scaling laws.
extends: this note establishes WHY sub-billion is the operative regime; depth-vs-width answers HOW to architect within it
Does recomputing weights cost less than moving them on mobile? Explores whether mobile hardware's memory bottleneck makes it cheaper to recompute transformer blocks than to fetch their weights twice, and whether this trades accuracy for efficiency.
extends: weight sharing is the design move that addresses the DRAM bandwidth constraint named here
Can small language models handle most agent tasks? Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
extends: complementary economic argument from the agent side — even where the device has compute headroom, SLMs are economically and architecturally preferable for most subtasks
Can architecture choices improve inference efficiency without sacrificing accuracy? Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
extends: gives the formal frame for inference-cost-aware scaling; the DRAM and battery facts here are precisely the variables that conditional scaling laws should incorporate
Can inference compute replace scaling up model size? Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
extends: opens an escape hatch for the constraint — small models can recover capability at inference via test-time compute, partially neutralizing the parameter ceiling

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

sub-billion parameter LLMs are forced by mobile DRAM and battery constraints not by quality preference — a 7B model drains a phone in under two hours

What actually limits language models on mobile phones?

Inquiring lines that read this note 9

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4