INQUIRING LINE

Inquiring lines›How do language models construct a…›How are AI-generated and human-wri…›Do autonomous architecture discove…›this inquiring line

Training a cheap model and running a cheap model are two different problems — and now there's a formula for both.

How do conditional scaling laws incorporate hardware into architecture choices?

This explores a specific 2025 result — adding architecture variables to scaling laws so you can predict and optimize for inference cost on real hardware, not just training compute — and how that idea connects to a broader shift toward hardware- and memory-aware design.

This explores how 'conditional scaling laws' fold hardware realities into architecture decisions — the core move being to add architectural knobs to the usual compute-vs-accuracy curve so you can predict what a model will actually cost to run, not just to train. The clearest demonstration in the corpus augments standard scaling laws with variables like hidden size, the ratio of MLP to attention, and grouped-query attention (GQA) configuration. Once those are in the law, you can search for the architecture that's cheapest to serve at a fixed training budget — yielding up to 2.1% higher accuracy and 42% more throughput than a LLaMA-3.2 baseline trained on the same budget Can architecture choices improve inference efficiency without sacrificing accuracy?. The key reframing is that training compute and inference compute are not the same resource, and optimizing for one doesn't optimize for the other.

That training/inference split is the deeper current running underneath the question. Snell et al. showed inference compute can substitute for raw parameter count — a smaller model that 'thinks longer' can match a bigger one on hard prompts — which means the two are tradeable, not independent Can inference compute replace scaling up model size?. Conditional scaling laws are essentially the design-time version of that insight: if inference is a first-class cost, then the architecture should be chosen to minimize it, and the scaling law is the tool that lets you do the accounting before you commit.

Where hardware enters most concretely is at the device edge, where the bottleneck stops being FLOPs and becomes memory movement. MobileLLM is the sharpest case: on memory-bound mobile chips, recomputing a transformer block twice is cheaper than fetching a second block's weights from memory, so weight-sharing buys accuracy at no latency cost — a choice that only makes sense once you let the hardware's memory bandwidth dictate the architecture Does recomputing weights cost less than moving them on mobile?. The same work found that for sub-billion-parameter models, deep-and-thin beats wide-and-balanced, directly contradicting the original Kaplan scaling laws — a reminder that scaling laws are conditional on regime and substrate, not universal Does depth matter more than width for tiny language models?.

The corpus also suggests the frontier of 'what to scale' is itself moving off of parameters. One thread argues memory architecture has overtaken parameter count as the primary scaling dimension, with hybrid sparsity laws governing the returns Has memory architecture replaced parameter count as the scaling frontier?. Another finds a U-shaped law where balancing O(1) N-gram lookup against Mixture-of-Experts computation beats spending everything on either — lookup and compute become complementary axes you allocate across Can lookup memory and computation work together better than either alone?. Both are conditional-scaling-law thinking generalized: the law gains new variables (memory, lookup, sparsity), and the optimal point depends on what the hardware makes cheap.

The thing worth carrying away: a scaling law isn't a law of nature, it's a budget model, and a 'conditional' one simply admits more of reality's knobs — attention configuration, memory bandwidth, lookup vs. compute — into the optimization. The payoff isn't a bigger model; it's a model shaped to the machine it has to run on. If you want to see how far this 'compute as a tunable axis' idea travels, the test-time-scaling taxonomy work shows the same accounting being applied to inference-time search and reasoning rather than to architecture How do internal and external test-time scaling compare?.

Sources 7 notes

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Has memory architecture replaced parameter count as the scaling frontier?

Three converging signals in late-2025 research—taxonomy maturation, memory-aware test-time scaling loops, and hybrid sparsity laws—show that returns from restructuring memory now exceed returns from adding parameters. The design bottleneck has shifted from compute to memory structure.

Show all 7 sources

Can lookup memory and computation work together better than either alone?

Engram combines O(1) N-gram lookup with Mixture-of-Experts routing, revealing a U-shaped scaling law where balanced allocation to both mechanisms outperforms either alone. Gains appear largest in reasoning and code rather than pure retrieval.

How do internal and external test-time scaling compare?

Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst evaluating whether conditional scaling laws—which fold hardware constraints into architecture optimization—remain predictive or have been superseded by newer training/inference paradigms. The question: do scaling laws that incorporate architectural variables (hidden size, MLP/attention ratio, GQA) still govern hardware-efficient design, or have recent advances in test-time compute, reasoning models, and agent orchestration dissolved the architecture-level constraint?

What a curated library found — and when (dated claims, not current truth):
- Conditional scaling laws augmented with architectural knobs predict inference cost; optimal architecture can yield 2.1% accuracy gain and 42% throughput over LLaMA-3.2 baseline at fixed training budget (arXiv:2510.18245, ~2025–2026).
- On memory-bound devices, weight-sharing and deep-thin architectures beat wide-balanced designs; training/inference compute are independent resources requiring separate optimization (arXiv:2402.14905, 2024).
- Memory architecture and lookup-versus-compute are emerging as primary scaling axes alongside parameters; sparsity laws now govern returns (arXiv:2601.07372, arXiv:2602.xxxx, ~2026).
- Test-time compute (reasoning depth, search, reflection) can substitute for model parameters, reframing the optimization frontier away from static architecture toward dynamic allocation (arXiv:2502.05171, arXiv:2507.01951, 2025).

Anchor papers (verify; mind their dates):
- arXiv:2402.14905 (MobileLLM, 2024): hardware-aware sub-billion-parameter design.
- arXiv:2510.18245 (Scaling Laws Meet Model Architecture, 2025–2026): conditional laws with architectural variables.
- arXiv:2601.07372 (Conditional Memory via Lookup, 2026): memory and sparsity as scaling dimensions.
- arXiv:2502.05171 (Test-Time Scaling, 2025): inference-time reasoning as parameter substitute.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, ask: (a) have newer models (o1, o3, Claude 3.5+) with test-time reasoning made the static architecture choice less binding? (b) Do recent agentic and multi-turn frameworks (retrieval, memory caching, tool use) reframe what 'architecture' means? (c) Has a new evaluation harness or training recipe (e.g., RL for reasoning, on-device inference SDKs) relaxed the memory-bandwidth bottleneck? Plainly state where each constraint still holds and what dissolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—papers that argue parameter scaling, instruction-tuning, or orchestration dynamics have trumped hardware-conditional architecture design.
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., "Do scaling laws for agentic inference (where tool-use and retrieval dominate) behave differently than those for dense forward-pass?" and "Does fine-tuning or RL post-training break the hardware-architecture coupling found in base-model scaling?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training a cheap model and running a cheap model are two different problems — and now there's a formula for both.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8