INQUIRING LINE

Can architectural changes alone achieve compute-optimal per-prompt scaling?

This explores whether redesigning the model itself — its depth, attention ratios, internal structure — can deliver the win of spending the right amount of compute on each prompt, or whether that goal depends on things architecture can't touch.


This explores whether redesigning the model itself — its depth, attention ratios, internal structure — can deliver compute-optimal per-prompt scaling, or whether that goal lives partly outside architecture. The corpus suggests architecture buys you real efficiency but never the whole prize: the 'per-prompt' part of the question is fundamentally about how you *deploy* compute, not how you *build* the model.

Start with what compute-optimal per-prompt scaling actually means. The core finding is that prompts differ wildly in difficulty, and reallocating a fixed compute budget — less for easy prompts, more for hard ones — beats spending it uniformly, and can even beat a larger model under a flat budget Can we allocate inference compute based on prompt difficulty?. Inference compute can substitute for parameter scaling on hard prompts specifically, which means pretraining and inference are not separate resources you can optimize in isolation Can inference compute replace scaling up model size?. That framing already tells you architecture alone can't get there: the adaptivity is a runtime decision about *this* prompt, not a fixed property baked into weights.

Where architecture genuinely helps is the per-token cost and throughput side. Folding architectural variables — hidden size, MLP-to-attention ratio, GQA configuration — into scaling laws let researchers tune for inference, hitting 42% higher throughput *and* better accuracy under the same training budget Can architecture choices improve inference efficiency without sacrificing accuracy?. At small scale, depth-over-width contradicts the old width-balanced recipes and yields real gains by composing concepts through layers Does depth matter more than width for tiny language models?. So architecture moves the efficiency frontier — but it makes *every* prompt cheaper, not the right prompts cheaper. It's orthogonal to the adaptive-allocation question, not a substitute for it.

The sharper limit comes from a finding that should give architecture-only optimism pause: non-reasoning models can't catch reasoning models no matter how much inference compute you throw at them, because the training regime installs a protocol that makes extra tokens *productive* Can non-reasoning models catch up with more compute?. Compute is only as good as the model's learned ability to use it. Even structural tricks that look architectural — recursive subtask trees with KV-cache pruning Can recursive subtask trees overcome context window limits?, or treating the long-context bottleneck as compute-to-consolidate rather than memory Is long-context bottleneck really about memory or compute? — work because they restructure *how reasoning unfolds at test time*, not because they're static design choices. And the same lesson appears from the prompt side: optimizing a prompt without knowing the inference strategy systematically misfires, while jointly optimizing prompt and inference yields up to 50% gains Does prompt optimization without inference strategy fail?.

So the honest answer is no — and the more interesting reframing is *why*. Per-prompt optimality is a property of the whole pipeline: training instills the protocol, architecture sets the cost-per-token floor, and a runtime controller decides how much to spend on each input. The work pointing toward unification treats agents as optimizable computational graphs where prompts, structure, and coordination get tuned together Can we automatically optimize both prompts and agent coordination?. The thing you didn't know you wanted to know: the field is quietly converging on the idea that 'architecture,' 'training,' and 'inference strategy' are three knobs on one optimization problem — and pulling only the architecture knob leaves most of the compute-optimal gain on the table.


Sources 9 notes

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Does prompt optimization without inference strategy fail?

Prompts optimized without knowledge of the inference strategy (best-of-N, majority voting) systematically underperform. Joint optimization of both prompt and inference strategy yields up to 50% improvement across reasoning and generation tasks.

Can we automatically optimize both prompts and agent coordination?

Language agents represented as computational graphs—where nodes are operations and edges define information flow—reveal that CoT, ToT, and Reflexion are formally equivalent structures. This unified view enables automatic optimization of both node prompts and edge connectivity without manual redesign.

Next inquiring lines