INQUIRING LINE

What mobile hardware constraints force the sub-billion parameter regime?

This explores the actual physical limits — memory and power — that make sub-billion-parameter models the only practical option on a phone, rather than a quality compromise developers settle for.


This explores the actual physical limits that make sub-billion-parameter models the only practical option on a phone, rather than a quality compromise. The corpus is blunt about it: the constraints are DRAM budget and battery capacity, not a preference for smaller, weaker models. A 7-billion-parameter model drains a typical 50kJ phone battery in under two hours, while a 350M model can run conversational AI for a full day on the same hardware What actually limits language models on mobile phones?. Once you frame it as energy-per-token rather than accuracy-per-parameter, the sub-billion ceiling stops looking arbitrary and starts looking like physics.

What's interesting is that the same memory wall reshapes how the model should be built, not just how big it is. On a phone the bottleneck is moving weights through memory, not computing with them — so MobileLLM found that recomputing a transformer block twice is actually cheaper than fetching a second block's weights from memory, buying accuracy with zero extra parameters Does recomputing weights cost less than moving them on mobile?. The hardware constraint flips an intuition: compute is the cheap resource on-device, memory movement is the expensive one.

That same pressure overturns a piece of scaling orthodoxy. At the 125M–350M scale, going deep-and-thin beats spreading the same parameter budget across width, yielding 2.7–4.3% accuracy gains by composing abstract concepts layer by layer Does depth matter more than width for tiny language models?. The Kaplan scaling laws that hold for datacenter models don't govern the phone regime — when parameters are capped by DRAM, you spend them differently.

The deeper lesson the corpus offers is that 'just make the model smaller' isn't the only escape hatch. You can move intelligence off the parameter axis entirely: spend more compute at inference time instead of more weights, which lets small models match larger ones on hard prompts Can inference compute replace scaling up model size?. Or keep the small model on-device and route only the genuinely hard queries to a bigger model elsewhere, cutting cost 40–50% by predicting query difficulty before generation Can routers select the right model before generation happens?. The mobile constraint, read this way, isn't just a size limit — it's a forcing function pushing capability out of raw parameter count and into architecture, inference-time compute, and where computation physically happens.


Sources 5 notes

What actually limits language models on mobile phones?

Smartphones' DRAM budgets and battery capacity make sub-billion-parameter models the only sustainable option for mobile deployment. A 7B model drains a 50kJ battery in under two hours, while a 350M model can run conversational AI for a full day on the same device.

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Can routers select the right model before generation happens?

RouteLLM and Hybrid-LLM both achieve 40-50% cost reduction by routing to a single model based on query difficulty prediction, not response evaluation. Single-model routing minimizes latency compared to ensemble or cascade alternatives.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mobile-AI systems researcher. The question remains live: What physical hardware constraints actually force the sub-billion-parameter ceiling on phones, and can those constraints be circumvented rather than accepted?

What a curated library found — and when (findings span Feb 2024–Mar 2026, treat as dated claims):
• DRAM and battery capacity, not quality preference, impose the sub-billion ceiling: a 7B model drains a 50 kJ phone battery in <2 hrs; a 350M model runs all day (2024-02).
• On mobile, memory movement—not compute—is the bottleneck; recomputing a transformer block twice is cheaper than fetching a second block's weights, decoupling parameter count from accuracy (2024-02).
• At 125M–350M scale, depth-and-thin architectures beat width-spreading by 2.7–4.3%, contradicting datacenter scaling laws where parameters are DRAM-capped (2024-02).
• Test-time compute can substitute for parameter scaling on hard prompts; intelligent routing predicts query difficulty pre-generation, cutting inference cost 40–50% (2024-04, 2024-12).
• Recent work (2025–2026) explores test-time reasoning depth, parameter selectivity, and recursive inference as alternatives to growing model size.

Anchor papers (verify; mind their dates):
• arXiv:2402.14905 (Feb 2024) — MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
• arXiv:2404.14618 (Apr 2024) — Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing
• arXiv:2502.05171 (Feb 2025) — Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
• arXiv:2512.24601 (Dec 2025) — Recursive Language Models

Your task:
(1) RE-TEST EACH CONSTRAINT. For the Feb 2024 DRAM and battery claims: have newer mobile chips (e.g., Snapdragon X, Apple Neural Engine 2025+), quantization breakthroughs (sub-int8), or memory-mapping techniques since relaxed those limits? Does the memory-movement bottleneck still dominate compute on current flagship hardware, or has it shifted? Assess whether depth-beating-width still holds at 125M–350M or if new training regimes have flattened that advantage. Separate the durable physics (mobile DRAM is finite) from the perishable engineering (current DRAM footprints).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—esp. around test-time compute, recursive inference (arXiv:2512.24601), or parameter selectivity (arXiv:2508.21741). Do any of these make the sub-billion ceiling obsolete, or do they require offline pre-computation that breaks the mobile constraint?
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If test-time compute becomes tractable on-device via hardware acceleration (TPU/GPU in-phone), does sub-billion still dominate, or does per-token latency and energy become the new ceiling? (b) Can hybrid architectures—tiny on-device + lightweight server calls—let developers exceed the sub-billion Pareto front without violating latency or privacy SLOs?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines