How much does inference budget improve self-generated search performance?
This explores whether giving a model more compute at inference time — more search iterations, more reasoning tokens — actually buys better results when the model is running its own searches, and where that payoff stops.
This explores whether spending more inference budget reliably improves search a model runs for itself — and the corpus answer is a qualified yes, with sharp limits on the qualification. The cleanest result is that search behaves like reasoning: agentic deep research shows a genuine test-time scaling law, where adding search iterations improves answer quality along a monotonic-but-diminishing curve that looks identical to the curve for reasoning tokens Does search budget scale like reasoning tokens for answer quality?. So budget does help — and it opens a new knob, since you can now trade reasoning compute against search compute to hit a quality target. But 'diminishing returns' is the load-bearing phrase: the first dollars of budget buy a lot, the last ones almost nothing.
The more useful finding is that *how* you spend the budget matters more than *how much* you spend. Pouring uniform compute across every query wastes it — easy prompts get more than they need while hard ones starve. Allocating the same total budget adaptively, by prompt difficulty, beats both fixed budgets and simply using a larger model Can we allocate inference compute based on prompt difficulty? How should we allocate compute budget at inference time?. For multi-turn self-generated search there's an even sharper version of this: unrestricted reasoning *within* a single turn eats the context window that later retrieval rounds need, so capping reasoning per turn — not just overall — actually preserves search quality across iterations Does limiting reasoning per turn improve multi-turn search quality?. In other words, more budget naively applied can degrade long-horizon search; budget shaped to the task improves it.
The limit worth knowing: inference budget can't substitute for training. Non-reasoning models never catch up to reasoning models no matter how much inference compute you throw at them, because training instills a protocol that makes the extra tokens productive in the first place Can non-reasoning models catch up with more compute?. Budget amplifies a capability the model already has; it doesn't manufacture one. This reframes the whole question — the ceiling on self-generated search isn't set at inference time, it's set by what the model was taught to do with the tokens.
That points to the more dramatic returns, which come from improving the search *method* rather than the search *budget*. A bilevel system that reads its own inner-loop code and writes new search mechanisms at runtime found combinatorial-optimization and bandit strategies that delivered a 5x improvement — far past anything budget scaling alone offers Can an AI system improve its own search methods automatically?. And training search agents on harder synthetic data (knowledge-graph walks with blurred entities) let a 32B model outperform larger models on browsing benchmarks Can knowledge graphs generate training data for search agents?. The thread across all of these: inference budget is a real lever with a real ceiling, and the biggest wins live in training and method design, where you change the shape of the curve rather than just sliding further along it.
Sources 7 notes
Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.
Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.
Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.
Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
An outer loop successfully read inner loop code, identified bottlenecks, and generated new Python mechanisms at runtime, discovering combinatorial optimization and bandit methods that broke the inner loop's deterministic patterns and improved performance on GPT pretraining by 5x.
KG-based random walks with selective entity obscuring create verifiable, multi-hop questions that train deep search agents effectively. DeepDive-32B trained on this data achieves 14.8% on BrowseComp, outperforming larger models through end-to-end multi-turn RL.