INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How do multi-agent reasoning syste…›Can inference-time compute substit…›this inquiring line

Two models trained on equal compute can differ by 42% in speed — the blueprint matters more than the budget.

Why does architecture matter more than training compute for inference efficiency?

This explores why how a model is built (its architecture and the reasoning protocol baked in during training) can shape inference efficiency more decisively than simply throwing more compute at it — whether at train time or test time.

This explores why how a model is built — its architecture and the reasoning protocol baked in during training — can govern inference efficiency more than raw compute, whether spent on training or at test time. The cleanest case for the claim comes from rewriting the scaling laws themselves: when you fold architectural variables like hidden size, the ratio of MLP to attention, and grouped-query attention into the prediction, you can find models that run 42% faster while also scoring 2.1% higher — under the *same* training budget Can architecture choices improve inference efficiency without sacrificing accuracy?. Compute didn't change; the shape did. The same lesson shows up at the small end, where MobileLLM finds that deep-and-thin networks beat wide ones at the 125M–350M scale, because stacking layers lets the model compose abstract concepts rather than just spreading parameters sideways Does depth matter more than width for tiny language models?.

The deeper reason architecture wins is that compute is only as useful as the structure that spends it. A non-reasoning model handed unlimited inference budget still can't catch a reasoning model, because training instills a protocol that makes each extra token productive — the gap is about deployment mechanism, not raw horsepower Can non-reasoning models catch up with more compute?. So 'more compute' is not a free lever: without the right learned structure, the tokens are wasted. That's also why test-time compute can substitute for parameter scaling on hard prompts in the first place — pretraining and inference compute aren't independent resources, they trade against each other through the model's structure Can inference compute replace scaling up model size?.

Where it gets interesting is that some architectures break ceilings that no amount of compute can buy you. Fixed-depth transformers sit inside a complexity class (the AC0/TC0 ceiling) that caps what they can compute per forward pass; a hierarchical recurrent model with just 27M parameters escapes it by running fast detailed computation under slow abstract planning, solving Sudoku and mazes where chain-of-thought fails outright Can recurrent hierarchies achieve reasoning that transformers cannot?. Energy-based transformers make a related move — treating inference as gradient-descent energy minimization — and squeeze 29% more out of inference compute than a strong transformer baseline Can energy minimization unlock reasoning without domain-specific training?. The architecture changes what a unit of compute can even reach.

The efficiency story isn't only about the network graph, though — it's also about *when* and *how* compute is spent, which is itself an architectural choice. Allocating inference budget adaptively (less on easy prompts, more on hard ones) beats a bigger model spending uniformly Can we allocate inference compute based on prompt difficulty?, and a model can be trained to route itself between deep thinking and quick answers rather than always paying the full cost Can models learn when to think versus respond quickly?. Even the reasoning trace can be slimmed: models internally rank tokens by functional importance, so pruning the meta-discourse while keeping the symbolic computation yields students that beat ones trained on raw frontier-model output Which tokens in reasoning chains actually matter most?.

The thread running through all of these: compute is a quantity, architecture is a structure, and structure decides the exchange rate. A frozen backbone with a small bolt-on reasoning assistant preserves capability while adding continuous thought Can continuous reasoning avoid forgetting in instruction-tuned models?, and small models trained on the *right* signal — preference pairs that surface negative examples — match much larger ones on structured tasks Can small models match large models on function calling?. In every case the win comes not from spending more, but from spending into a shape that converts each token of compute into more useful work.

Sources 11 notes

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Show all 11 sources

Can energy minimization unlock reasoning without domain-specific training?

Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling3.34 match · arxiv ↗
Hierarchical Reasoning Model2.58 match · arxiv ↗
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs2.58 match · arxiv ↗
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity2.54 match · arxiv ↗
Reasoning Models Can Be Effective Without Thinking2.54 match · arxiv ↗
Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking2.49 match · arxiv ↗
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning1.70 match · arxiv ↗
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training1.69 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking inference-efficiency claims in LLM architecture. The question remains open: does architectural choice genuinely decouple inference efficiency from training compute, or have newer models/methods blurred that boundary?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat them as perishable snapshots:
• Rewriting scaling laws to include architecture (hidden size, MLP–attention ratio, GQA) yields 42% speedup + 2.1% accuracy gain under identical training budget (~2025).
• Deep-and-thin networks outperform wide ones at 125M–350M parameters; hierarchical recurrence with 27M params escapes AC0/TC0 computational ceiling, solving tasks where chain-of-thought fails (~2025–2026).
• Energy-based transformers achieve 29% inference-compute gains via gradient-descent energy minimization; adaptive per-prompt budget allocation and learned reasoning routing (thinking vs. quick answer) beat uniform spending (~2025).
• Reasoning traces can be pruned by ranking tokens' functional importance; small models trained on preference pairs (with negatives) match much larger ones on structured tasks (~2025–2026).
• Soft chain-of-thought delegates continuous thought to frozen backbones (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2510.18245 — Scaling Laws Meet Model Architecture (2025).
• arXiv:2507.02092 — Energy-Based Transformers (2025).
• arXiv:2506.21734 — Hierarchical Reasoning Model (2025).
• arXiv:2601.03066 — Functional Importance of Reasoning Tokens (2026).

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 42% speedup claim and deep-vs-wide trade-off: have subsequent scaling laws, dense-vs-sparse methods, or newer training paradigms (e.g., RL on test-time compute) since relaxed or *recombined* the efficiency–accuracy frontier? Separate durable insight (architecture constrains what compute can reach) from perishable empirical claim (specific speedup numbers). Does adaptive routing or pruning still hold across frontier scales, or does it break under distribution shift?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Have any papers argued architecture is a *proxy* for training data quality, sample efficiency, or implicit regularization—collapsing the architecture/compute distinction?
(3) Propose 2 research questions assuming the regime may have moved: (a) Can unified scaling laws now predict inference efficiency *without* explicit architectural parameters, by folding structure into learned representations? (b) Does the architectural ceiling (e.g., AC0/TC0) dissolve under very long-horizon training or multi-agent composition?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Two models trained on equal compute can differ by 42% in speed — the blueprint matters more than the budget.

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8