How do conditional scaling laws incorporate hardware into architecture choices?
This explores a specific 2025 result — adding architecture variables to scaling laws so you can predict and optimize for inference cost on real hardware, not just training compute — and how that idea connects to a broader shift toward hardware- and memory-aware design.
This explores how 'conditional scaling laws' fold hardware realities into architecture decisions — the core move being to add architectural knobs to the usual compute-vs-accuracy curve so you can predict what a model will actually cost to run, not just to train. The clearest demonstration in the corpus augments standard scaling laws with variables like hidden size, the ratio of MLP to attention, and grouped-query attention (GQA) configuration. Once those are in the law, you can search for the architecture that's cheapest to serve at a fixed training budget — yielding up to 2.1% higher accuracy and 42% more throughput than a LLaMA-3.2 baseline trained on the same budget Can architecture choices improve inference efficiency without sacrificing accuracy?. The key reframing is that training compute and inference compute are not the same resource, and optimizing for one doesn't optimize for the other.
That training/inference split is the deeper current running underneath the question. Snell et al. showed inference compute can substitute for raw parameter count — a smaller model that 'thinks longer' can match a bigger one on hard prompts — which means the two are tradeable, not independent Can inference compute replace scaling up model size?. Conditional scaling laws are essentially the design-time version of that insight: if inference is a first-class cost, then the architecture should be chosen to minimize it, and the scaling law is the tool that lets you do the accounting before you commit.
Where hardware enters most concretely is at the device edge, where the bottleneck stops being FLOPs and becomes memory movement. MobileLLM is the sharpest case: on memory-bound mobile chips, recomputing a transformer block twice is cheaper than fetching a second block's weights from memory, so weight-sharing buys accuracy at no latency cost — a choice that only makes sense once you let the hardware's memory bandwidth dictate the architecture Does recomputing weights cost less than moving them on mobile?. The same work found that for sub-billion-parameter models, deep-and-thin beats wide-and-balanced, directly contradicting the original Kaplan scaling laws — a reminder that scaling laws are conditional on regime and substrate, not universal Does depth matter more than width for tiny language models?.
The corpus also suggests the frontier of 'what to scale' is itself moving off of parameters. One thread argues memory architecture has overtaken parameter count as the primary scaling dimension, with hybrid sparsity laws governing the returns Has memory architecture replaced parameter count as the scaling frontier?. Another finds a U-shaped law where balancing O(1) N-gram lookup against Mixture-of-Experts computation beats spending everything on either — lookup and compute become complementary axes you allocate across Can lookup memory and computation work together better than either alone?. Both are conditional-scaling-law thinking generalized: the law gains new variables (memory, lookup, sparsity), and the optimal point depends on what the hardware makes cheap.
The thing worth carrying away: a scaling law isn't a law of nature, it's a budget model, and a 'conditional' one simply admits more of reality's knobs — attention configuration, memory bandwidth, lookup vs. compute — into the optimization. The payoff isn't a bigger model; it's a model shaped to the machine it has to run on. If you want to see how far this 'compute as a tunable axis' idea travels, the test-time-scaling taxonomy work shows the same accounting being applied to inference-time search and reasoning rather than to architecture How do internal and external test-time scaling compare?.
Sources 7 notes
Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Three converging signals in late-2025 research—taxonomy maturation, memory-aware test-time scaling loops, and hybrid sparsity laws—show that returns from restructuring memory now exceed returns from adding parameters. The design bottleneck has shifted from compute to memory structure.
Engram combines O(1) N-gram lookup with Mixture-of-Experts routing, revealing a U-shaped scaling law where balanced allocation to both mechanisms outperforms either alone. Gains appear largest in reasoning and code rather than pure retrieval.
Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.