Can compute-optimal scaling work without co-optimizing the prompt itself?
This explores whether the gains from compute-optimal inference scaling (spending more tokens on hard prompts, fewer on easy ones) hold up when the prompt is treated as fixed — or whether the prompt and the inference strategy have to be tuned together.
This explores whether compute-optimal scaling — allocating inference budget adaptively rather than uniformly — can deliver its gains while the prompt stays fixed. The corpus's sharpest answer is that it mostly can't, at least not optimally. The whole premise of compute-optimal scaling is that effectiveness varies dramatically by prompt difficulty, so the same total budget goes further when easy prompts get less and hard ones get more Can we allocate inference compute based on prompt difficulty? How should we allocate compute budget at inference time?. Snell et al. pushed this far enough to show inference compute can substitute for raw parameter scaling on hard prompts Can inference compute replace scaling up model size?. But all of these results measure difficulty *through* a prompt — so the prompt isn't a neutral container for the budget, it's part of what determines how much budget is needed.
The most direct hit on your question is the finding that prompts optimized without knowledge of the inference strategy systematically underperform. When a prompt is tuned in isolation and then handed to best-of-N or majority voting, the two pull against each other; jointly optimizing prompt *and* inference strategy yields up to 50% improvement Does prompt optimization without inference strategy fail?. That's the inverse of your question stated as a result: decoupling the two is exactly the failure mode. A prompt that's great for a single greedy pass can be the wrong prompt once you're sampling twenty trajectories and aggregating them.
What's interesting is *why* they're entangled rather than just *that* they are. One line of work argues the prompt is effectively a program — a single finite transformer can compute any computable function given the right prompt Can a single transformer become universally programmable through prompts?. If the prompt is the program and the inference strategy is how many times and in what pattern you run it, then 'scale compute but freeze the prompt' is like optimizing a runtime while forbidding any change to the source. Another finding shows the right prompt isn't even stable across models: step-by-step prompting helps cheap models but *reduces* accuracy in high-performance ones Do prompt techniques work the same across all LLM tiers?. So a fixed prompt isn't a fixed lever — its value shifts with the very compute regime you're trying to scale.
The reframe worth taking away: the field is increasingly treating prompt, inference structure, and architecture as one joint optimization surface rather than separate dials. Language agents can be expressed as computational graphs where node prompts and the edges connecting them are optimized on the same footing, revealing CoT, ToT, and Reflexion as variations of one structure Can we automatically optimize both prompts and agent coordination?. Scaling laws have been extended to fold in architectural variables for inference efficiency Can architecture choices improve inference efficiency without sacrificing accuracy?. And there's a hard ceiling worth knowing about: training regime can dominate everything else — non-reasoning models don't catch up to reasoning models no matter how much inference budget you throw at them, because the training instilled a protocol that makes extra tokens productive Can non-reasoning models catch up with more compute?. So 'can compute-optimal scaling work alone?' generalizes into a more useful question — compute is one of several co-dependent resources (prompt, inference shape, architecture, training), and freezing any one of them caps what scaling the others can buy.
Sources 9 notes
Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.
Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
Prompts optimized without knowledge of the inference strategy (best-of-N, majority voting) systematically underperform. Joint optimization of both prompt and inference strategy yields up to 50% improvement across reasoning and generation tasks.
Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.
A 23-prompt benchmark across 12 LLMs shows rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning reduces accuracy in high-performance models. Task structure, not generic best practices, determines which prompts help.
Language agents represented as computational graphs—where nodes are operations and edges define information flow—reveal that CoT, ToT, and Reflexion are formally equivalent structures. This unified view enables automatic optimization of both node prompts and edge connectivity without manual redesign.
Augmenting scaling laws with hidden size, MLP-to-attention ratio, and GQA configuration enables architecture optimization for inference. Optimized models achieved up to 2.1% higher accuracy and 42% greater throughput than LLaMA-3.2 under identical training budgets.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.