Can architecture choices improve inference efficiency without sacrificing accuracy?
Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
Standard scaling laws (Chinchilla) optimize the trade-off between model parameters and training data for a fixed training compute budget. They say nothing about inference cost. But as LLMs move from research to deployment, inference cost dominates — and architecture choices affect inference efficiency in ways that parameter count alone does not predict.
The conditional scaling law augments Chinchilla by conditioning on three architectural variables: hidden size, the ratio of MLP parameters to attention parameters, and grouped-query attention (GQA) configuration. These variables affect inference throughput independently of their effect on accuracy. A model with the same parameter count and training budget can have dramatically different inference costs depending on how those parameters are allocated between MLP and attention layers.
Empirical validation across 200+ models (80M-3B parameters, 8B-100B training tokens): optimized architectures achieve up to 2.1% higher accuracy AND 42% greater inference throughput compared to LLaMA-3.2 under the same training budget. The "and" is the key finding — accuracy and inference efficiency are not zero-sum when architecture is treated as a free variable. Suboptimal architectures simultaneously sacrifice both.
This adds a third optimization lever to the inference compute landscape. Can inference compute replace scaling up model size? establishes the training-inference compute trade-off. Can we allocate inference compute based on prompt difficulty? establishes adaptive allocation. Architecture optimization sits upstream of both: it determines the baseline efficiency at which every unit of inference compute converts to performance. A 42% throughput improvement means the same inference budget produces 42% more reasoning attempts, parallel samples, or search steps.
For reasoning systems that scale inference compute extensively, the architectural multiplier compounds: a model that's 42% more efficient per inference step gets 42% more exploration per token budget, which matters disproportionately for approaches like Why does parallel reasoning outperform single chain thinking? where more parallel attempts directly improve accuracy.
Inquiring lines that use this note as a source 45
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do integrated and decoupled architectures trade off intervention accuracy for efficiency differently?
- Can likelihood choice matter more than architectural depth for CF?
- How does inference compute substitution affect the training parameter scaling trade-off?
- Does parallel thinking benefit disproportionately from higher inference throughput architectures?
- Does diffusion's control advantage come from speed gains or from architectural differences?
- Why do scaling laws fail to predict optimal architectures at small parameter counts?
- Can architecture changes and early stopping combine to close the diffusion inference gap?
- Can scaling predictions become reliable if improvements are continuous not sudden?
- Does the optimal model size depend on what capabilities you actually need?
- Why does depth outperform width for sub-billion parameter models?
- How do conditional scaling laws incorporate hardware into architecture choices?
- Does trading model size for inference steps improve overall efficiency scaling?
- Do autonomous architecture discoveries follow predictable scaling laws like human research?
- How does the Ladder of Scales approach reduce search costs across model sizes?
- Why do scaling laws show capability saturation at specific thresholds?
- Can depth scaling and breadth scaling unlock independent capability axes?
- How much inference efficiency do we gain by eliminating self-correction passes?
- How much do structural inductive biases matter compared to training data volume?
- Can compute-optimal scaling work without co-optimizing the prompt itself?
- What scaling laws govern autonomous architecture discovery in AI systems?
- What limits exist on retrieval budget during inference?
- What deployment tradeoffs emerge between single-pass and multi-pass inference adaptation?
- Can inference budgets be allocated differently based on prompt difficulty?
- Do small models show different parameter efficiency patterns than large models?
- Why does recomputing weights cost less than moving them on phones?
- Can attention mechanisms improve on Wide & Deep's static feature crosses?
- What makes a small surgical wide component sufficient with a capable deep model?
- What other internal model decisions beyond attention could be optimized directly?
- What are the scaling law differences between vision and language learning?
- Can sleep-time compute reduce latency demands during model inference?
- What inference-time scaling benefits emerge from reasoning before each prediction?
- Why do vision and language have different optimal scaling curves?
- Can inference budgets be allocated adaptively based on prompt difficulty?
- Why should scaling laws be understood as properties of data distribution rather than training in general?
- Can test-time compute scaling substitute for larger model parameters?
- What architectural variables most improve inference efficiency today?
- Why does the right structural prior matter more than raw model capacity?
- Can data pruning and equal contribution be reconciled in optimal learning?
- Why do optimal learning dynamics improve scaling law coefficients specifically?
- Can spiking sparsity replace weight quantization as a primary efficiency lever?
- Why does architecture matter more than training compute for inference efficiency?
- Do scaling laws change when weight precision becomes a design variable?
- Can attention linearity achieve similar efficiency gains as weight quantization?
- Why do hybrid attention architectures outperform pure linear attention models?
- What are the concrete efficiency gains of linear-attention state-space models?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can inference compute replace scaling up model size?
Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
adds a third lever: architecture selection affects the conversion rate between inference compute and performance
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
architecture optimization is upstream: it determines baseline efficiency of every allocation decision
-
Why does parallel reasoning outperform single chain thinking?
Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
42% throughput improvement means 42% more parallel attempts per budget, compounding the parallel advantage
-
Can byte-level models match tokenized performance with better efficiency?
Tokenized models use fixed vocabularies and allocate equal compute per token, but what if we dynamically group bytes based on prediction difficulty instead? Could this approach achieve competitive performance while using fewer FLOPs?
parallel: BLT optimizes compute allocation at sub-token level; conditional scaling law optimizes at architecture level; both improve efficiency without increasing total compute
-
Do pretraining and fine-tuning scale independently in language models?
Can we decouple how model scale affects different training stages to independently improve factuality versus helpfulness? This matters for understanding whether these capabilities compete or can be optimized separately.
shared decomposition methodology: EFT decouples pretraining scale from fine-tuning scale revealing independent effects (factuality vs helpfulness), while conditional scaling laws decouple architecture from training compute revealing independent efficiency gains; both demonstrate that treating model quality as a single dimension misses optimizable axes
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
- SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
- Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models
- Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
- The Unreasonable Ineffectiveness of the Deeper Layers
- Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
- Scaling Laws for Neural Language Models
- A Survey on LLM Inference-Time Self-Improvement
Original note title
conditional scaling laws that incorporate architectural variables predict inference efficiency independently of training compute