INQUIRING LINE

Why does recomputing weights cost less than moving them on phones?

This explores why, on a phone, it's cheaper to run a neural network block twice than to load a fresh set of weights for the second block — and what that reveals about where mobile AI actually spends its budget.


This explores why recomputing weights can beat moving them on a phone — a result that only makes sense once you see that mobile hardware is bottlenecked by memory traffic, not by arithmetic. MobileLLM's block-wise weight sharing is the direct answer: by reusing one transformer block's weights and simply running that block twice, the model skips the costly step of fetching a separate set of weights from memory. The math of the extra pass is nearly free; the data movement it avoids is what was expensive. The result is a small accuracy gain with zero increase in parameter count Does recomputing weights cost less than moving them on mobile?.

The reason this trade works is that phones are memory- and battery-bound in ways that desktops and servers are not. DRAM budgets and battery capacity, not a preference for smaller or worse models, are what force mobile models below a billion parameters — a 7B model can drain a 50kJ battery in under two hours, while a 350M model runs conversational AI all day What actually limits language models on mobile phones?. When energy and memory bandwidth are the scarce resource, every weight you fetch costs more than the compute you'd spend regenerating its effect. That inverts the usual server intuition where compute is the thing you ration.

What's quietly interesting here is that this is one instance of a broader pattern: on constrained hardware, the architecture itself — not just the parameter count — is a lever you can tune for efficiency. Folding architectural variables like hidden size, MLP-to-attention ratio, and attention grouping into scaling laws lets you optimize specifically for inference, yielding large throughput gains while improving accuracy Can architecture choices improve inference efficiency without sacrificing accuracy?. Block sharing is a hand-designed version of the same move: spend the cheap resource (compute) to conserve the expensive one (memory traffic).

The deeper takeaway is that "cost" in machine learning is never absolute — it's set by whichever resource is scarcest on your hardware. The same logic shows up where sparsity lets you train bigger models within a fixed compute budget rather than paying quality for speed Does sparse attention trade off quality for speed?, and where persistent agents shift the meaningful cost denominator from tokens to finished artifacts once context can be cached and reused Do persistent agents really cost less per token?. Recomputing-over-moving on a phone is the same insight read through the lens of memory bandwidth: optimize for the bottleneck you actually have, not the one the textbook assumes.


Sources 5 notes

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

What actually limits language models on mobile phones?

Smartphones' DRAM budgets and battery capacity make sub-billion-parameter models the only sustainable option for mobile deployment. A 7B model drains a 50kJ battery in under two hours, while a 350M model can run conversational AI for a full day on the same device.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Does sparse attention trade off quality for speed?

The Sparse Frontier benchmark shows that at equivalent compute cost, larger sparse-attention models outperform smaller dense models on long-context tasks. Sparsity lets you train bigger models within the same budget, making it Pareto-improving rather than a pure trade-off.

Do persistent agents really cost less per token?

A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mobile ML systems analyst. The question remains open: under what hardware constraints does recomputing weights cost less than fetching them, and how far can this principle be pushed?

What a curated library found — and when (dated claims, not current truth): These findings span 2022–2026; treat them as constraints to re-test, not settled facts.
• Block-wise weight sharing (reusing one transformer block, running it twice) saves memory bandwidth at negligible compute cost on mobile; accuracy gains without parameter inflation (2024-02, MobileLLM).
• Sub-billion-parameter models are forced by DRAM budget (~50kJ battery drain in <2h for 7B; 350M runs all day) and memory bandwidth scarcity, not quality preference (2024-02).
• Conditional scaling laws incorporating hidden size, MLP-to-attention ratio, and attention grouping optimize inference efficiency within fixed memory/compute budgets (2025-10).
• Sparse attention trade-offs and persistent agent caching shift the cost denominator from per-token to per-artifact (2025-04, 2026-05).

Anchor papers (verify; mind their dates):
• arXiv:2402.14905 (MobileLLM, 2024-02)
• arXiv:2510.18245 (Scaling Laws Meet Model Architecture, 2025-10)
• arXiv:2504.17768 (The Sparse Frontier, 2025-04)
• arXiv:2605.26870 (Persistent AI Agents, 2026-05)

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, determine whether newer training methods (e.g., knowledge distillation, quantization advances), inference harnesses (SDKs with smarter memory caching), or architectural innovations (e.g., mixture-of-experts gating, dynamic sparsity) have relaxed the memory-bandwidth bottleneck or shifted the recompute–fetch trade-off. Separate the durable question ("when is compute cheaper than bandwidth?") from the perishable limitation ("block sharing is the only viable tactic"). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — e.g., any paper showing that aggressive prefetching, on-device quantized caches, or hardware-aware compilation have made memory movement competitive with recompute again, or vice versa.
(3) Propose 2 research questions that ASSUME the regime may have shifted: (a) Can multi-core cache coherence or async memory pipelines on modern phones (Snapdragon 8 Gen 4, A18) narrow the compute–memory gap? (b) Does the recompute–fetch trade flip again if models are trained end-to-end for on-device sparsity patterns?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines