What actually limits language models on mobile phones?
Is the shift toward smaller LLMs driven by quality trade-offs, or by hard physical constraints on device memory and battery life? This note examines whether sub-billion models are a practical necessity rather than a compromise.
The push to build sub-billion-parameter LLMs is often framed as a quality-cost trade-off, but MobileLLM's framing is sharper: at current device specs the larger models simply cannot run sustainably on phones. Modern smartphones have 6 to 12 GB of DRAM (iPhone 15 has 6 GB, Pixel 8 Pro has 12 GB), and any single app should use no more than 10 percent of DRAM because memory is shared with the OS and other apps. An 8-bit-quantized LLaMA 7B exceeds this budget. Energy is the second binding constraint: at roughly 0.1 joules per token per billion parameters, a 7B model consumes 0.7 J/token, and a fully charged iPhone with about 50 kJ of energy can sustain that model for less than two hours of conversation at 10 tokens per second — every 64 tokens drains 0.2 percent of the battery.
These numbers reframe sub-billion LLMs as the only practical regime for mobile deployment rather than as a compromise. A 350M 8-bit model at 0.035 J/token can support conversational use for a full day on the same battery, and a 125M model can run at 50 tokens per second on-device versus 3 to 6 tokens per second for the LLaMA 7B running through MLC Chat. The decoding speed advantage compounds the energy advantage — faster generation means less time the system stays in high-power inference state.
The macro-scale argument is also striking: deploying GPT-4-class models for the daily AI usage of every individual would require around 100 million H100 GPUs at 60 TFLOPS each, equivalent to roughly 160 Meta-scale companies. Mobile inference is not just a UX preference; it is the energy-feasible path to ubiquitous LLM use. The constraint flips the design question: instead of "how big can we make this model," the right question becomes the one Does depth matter more than width for tiny language models? and Does recomputing weights cost less than moving them on mobile? both answer — how should a model under one billion parameters be architected for the regime it must run in.
Inquiring lines that use this note as a source 9
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can routing enable heterogeneous SLM-first architectures at scale?
- What constraints force mobile deployments to operate in the sub-billion parameter regime?
- Does the optimal model size depend on what capabilities you actually need?
- Can smaller models actually perform well on specific downstream tasks?
- How do input length constraints reshape personalization system design choices?
- What mobile hardware constraints force the sub-billion parameter regime?
- How should tiny language models be architected differently than large ones?
- Why does recomputing weights cost less than moving them on phones?
- Which architectural choices matter most when a model must fit one billion parameters?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does depth matter more than width for tiny language models?
Explores whether deep-and-thin architectures outperform wide-and-shallow ones at sub-billion scales, and why this might contradict larger-model scaling laws.
extends: this note establishes WHY sub-billion is the operative regime; depth-vs-width answers HOW to architect within it
-
Does recomputing weights cost less than moving them on mobile?
Explores whether mobile hardware's memory bottleneck makes it cheaper to recompute transformer blocks than to fetch their weights twice, and whether this trades accuracy for efficiency.
extends: weight sharing is the design move that addresses the DRAM bandwidth constraint named here
-
Can small language models handle most agent tasks?
Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
extends: complementary economic argument from the agent side — even where the device has compute headroom, SLMs are economically and architecturally preferable for most subtasks
-
Can architecture choices improve inference efficiency without sacrificing accuracy?
Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
extends: gives the formal frame for inference-cost-aware scaling; the DRAM and battery facts here are precisely the variables that conditional scaling laws should incorporate
-
Can inference compute replace scaling up model size?
Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
extends: opens an escape hatch for the constraint — small models can recover capability at inference via test-time compute, partially neutralizing the parameter ceiling
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
- GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
- The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
- Can Large Language Models Reason and Optimize Under Constraints?
- How Many Instructions Can LLMs Follow at Once?
- Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
- Small Language Models are the Future of Agentic AI
- From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities
Original note title
sub-billion parameter LLMs are forced by mobile DRAM and battery constraints not by quality preference — a 7B model drains a phone in under two hours