Why does architecture matter more than training compute for inference efficiency?
This explores why how a model is built (its architecture and the reasoning protocol baked in during training) can shape inference efficiency more decisively than simply throwing more compute at it — whether at train time or test time.
This explores why how a model is built — its architecture and the reasoning protocol baked in during training — can govern inference efficiency more than raw compute, whether spent on training or at test time. The cleanest case for the claim comes from rewriting the scaling laws themselves: when you fold architectural variables like hidden size, the ratio of MLP to attention, and grouped-query attention into the prediction, you can find models that run 42% faster while also scoring 2.1% higher — under the *same* training budget Can architecture choices improve inference efficiency without sacrificing accuracy?. Compute didn't change; the shape did. The same lesson shows up at the small end, where MobileLLM finds that deep-and-thin networks beat wide ones at the 125M–350M scale, because stacking layers lets the model compose abstract concepts rather than just spreading parameters sideways Does depth matter more than width for tiny language models?.
The deeper reason architecture wins is that compute is only as useful as the structure that spends it. A non-reasoning model handed unlimited inference budget still can't catch a reasoning model, because training instills a protocol that makes each extra token productive — the gap is about deployment mechanism, not raw horsepower Can non-reasoning models catch up with more compute?. So 'more compute' is not a free lever: without the right learned structure, the tokens are wasted. That's also why test-time compute can substitute for parameter scaling on hard prompts in the first place — pretraining and inference compute aren't independent resources, they trade against each other through the model's structure Can inference compute replace scaling up model size?.
Where it gets interesting is that some architectures break ceilings that no amount of compute can buy you. Fixed-depth transformers sit inside a complexity class (the AC0/TC0 ceiling) that caps what they can compute per forward pass; a hierarchical recurrent model with just 27M parameters escapes it by running fast detailed computation under slow abstract planning, solving Sudoku and mazes where chain-of-thought fails outright Can recurrent hierarchies achieve reasoning that transformers cannot?. Energy-based transformers make a related move — treating inference as gradient-descent energy minimization — and squeeze 29% more out of inference compute than a strong transformer baseline Can energy minimization unlock reasoning without domain-specific training?. The architecture changes what a unit of compute can even reach.
The efficiency story isn't only about the network graph, though — it's also about *when* and *how* compute is spent, which is itself an architectural choice. Allocating inference budget adaptively (less on easy prompts, more on hard ones) beats a bigger model spending uniformly Can we allocate inference compute based on prompt difficulty?, and a model can be trained to route itself between deep thinking and quick answers rather than always paying the full cost Can models learn when to think versus respond quickly?. Even the reasoning trace can be slimmed: models internally rank tokens by functional importance, so pruning the meta-discourse while keeping the symbolic computation yields students that beat ones trained on raw frontier-model output Which tokens in reasoning chains actually matter most?.
The thread running through all of these: compute is a quantity, architecture is a structure, and structure decides the exchange rate. A frozen backbone with a small bolt-on reasoning assistant preserves capability while adding continuous thought Can continuous reasoning avoid forgetting in instruction-tuned models?, and small models trained on the *right* signal — preference pairs that surface negative examples — match much larger ones on structured tasks Can small models match large models on function calling?. In every case the win comes not from spending more, but from spending into a shape that converts each token of compute into more useful work.
Sources 11 notes
Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.
Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.
Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.