How does test-time scaling relate to token budget in agentic deep research?
This explores whether 'thinking harder at inference time' (test-time scaling) and 'how many tokens an agent is allowed to spend' are really the same lever — and the corpus says they're closely linked but not identical.
This explores whether test-time scaling and token budget are the same thing in deep research agents — and the surprising answer the corpus gives is that they're almost two names for one underlying axis, with an important caveat. The core insight is that searching is just another way of spending inference compute. Where we used to think of test-time scaling as 'let the model reason for more tokens,' deep research agents reveal a parallel curve: let the agent run more search steps. Multiple notes converge on this — search budget follows the *same* scaling shape as reasoning tokens, monotonic gains that flatten into diminishing returns (Do search steps follow the same scaling rules as reasoning tokens?, Does search budget scale like reasoning tokens for answer quality?, How does search scale like reasoning in agent systems?). The practical upshot is a new dial: you can trade reasoning budget against search budget to hit a quality target, treating both as fungible inference compute.
The link gets even more literal at the multi-agent level, where performance turns out to be mostly a function of raw token spending. Anthropic's own evaluations found that roughly 80% of the variance in multi-agent research quality comes from how many tokens the system burns — not from clever coordination between agents (Does token spending drive multi-agent research performance?, How does test-time scaling work at the agent level?). So in this regime, 'test-time scaling' and 'token budget' nearly collapse into the same measurement. That's the strong version of the relationship: scaling is spending.
But the corpus also pushes back on treating budget as a blunt instrument. Spending more isn't the same as spending well. Adaptive allocation — giving hard prompts more compute and easy ones less — beats uniform budgets, because flat spending wastes tokens on easy problems and starves hard ones (How should we allocate compute budget at inference time?). And there's a ceiling that no budget can buy past: non-reasoning models don't catch up to reasoning models no matter how much inference compute you throw at them, because the training regime is what makes extra tokens *productive* in the first place (Can non-reasoning models catch up with more compute?). This is the useful taxonomic split — internal test-time scaling (training a model to reason on its own) builds the capability, while external scaling (search, verification, more passes at inference) extracts performance from a capability that already exists (How do internal and external test-time scaling compare?). Token budget is the lever on the external side; it can't manufacture what the internal side never installed.
The thing you didn't know you wanted to know: the budget framing itself may be the wrong denominator for persistent agents. A 115-day case study found that ~83% of tokens were cache reads, not fresh generation — so when context persists and gets reused, the meaningful cost unit stops being 'tokens spent' and becomes 'artifacts completed' (Do persistent agents really cost less per token?). Test-time scaling laws tell you that more compute buys more quality along a predictable curve; the economics of long-lived agents tell you that *how* you account for that compute — per token vs. per finished result — changes which point on the curve is actually rational to chase.
Sources 9 notes
Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.
Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.
Test-time scaling laws generalize from reasoning to retrieval: search steps follow identical scaling curves to reasoning tokens, making deep research a test-time scaling problem. This insight reframes search as a compute axis comparable to chain-of-thought reasoning.
Anthropic's internal evals show token spending alone accounts for 80% of performance variance in multi-agent research systems. Model capability upgrades deliver larger gains than doubling token budget, suggesting efficiency matters as much as quantity.
Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.
Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.
A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.