INQUIRING LINE

How does test-time scaling relate to token budget in agentic deep research?

This explores whether 'thinking harder at inference time' (test-time scaling) and 'how many tokens an agent is allowed to spend' are really the same lever — and the corpus says they're closely linked but not identical.


This explores whether test-time scaling and token budget are the same thing in deep research agents — and the surprising answer the corpus gives is that they're almost two names for one underlying axis, with an important caveat. The core insight is that searching is just another way of spending inference compute. Where we used to think of test-time scaling as 'let the model reason for more tokens,' deep research agents reveal a parallel curve: let the agent run more search steps. Multiple notes converge on this — search budget follows the *same* scaling shape as reasoning tokens, monotonic gains that flatten into diminishing returns (Do search steps follow the same scaling rules as reasoning tokens?, Does search budget scale like reasoning tokens for answer quality?, How does search scale like reasoning in agent systems?). The practical upshot is a new dial: you can trade reasoning budget against search budget to hit a quality target, treating both as fungible inference compute.

The link gets even more literal at the multi-agent level, where performance turns out to be mostly a function of raw token spending. Anthropic's own evaluations found that roughly 80% of the variance in multi-agent research quality comes from how many tokens the system burns — not from clever coordination between agents (Does token spending drive multi-agent research performance?, How does test-time scaling work at the agent level?). So in this regime, 'test-time scaling' and 'token budget' nearly collapse into the same measurement. That's the strong version of the relationship: scaling is spending.

But the corpus also pushes back on treating budget as a blunt instrument. Spending more isn't the same as spending well. Adaptive allocation — giving hard prompts more compute and easy ones less — beats uniform budgets, because flat spending wastes tokens on easy problems and starves hard ones (How should we allocate compute budget at inference time?). And there's a ceiling that no budget can buy past: non-reasoning models don't catch up to reasoning models no matter how much inference compute you throw at them, because the training regime is what makes extra tokens *productive* in the first place (Can non-reasoning models catch up with more compute?). This is the useful taxonomic split — internal test-time scaling (training a model to reason on its own) builds the capability, while external scaling (search, verification, more passes at inference) extracts performance from a capability that already exists (How do internal and external test-time scaling compare?). Token budget is the lever on the external side; it can't manufacture what the internal side never installed.

The thing you didn't know you wanted to know: the budget framing itself may be the wrong denominator for persistent agents. A 115-day case study found that ~83% of tokens were cache reads, not fresh generation — so when context persists and gets reused, the meaningful cost unit stops being 'tokens spent' and becomes 'artifacts completed' (Do persistent agents really cost less per token?). Test-time scaling laws tell you that more compute buys more quality along a predictable curve; the economics of long-lived agents tell you that *how* you account for that compute — per token vs. per finished result — changes which point on the curve is actually rational to chase.


Sources 9 notes

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

How does search scale like reasoning in agent systems?

Test-time scaling laws generalize from reasoning to retrieval: search steps follow identical scaling curves to reasoning tokens, making deep research a test-time scaling problem. This insight reframes search as a compute axis comparable to chain-of-thought reasoning.

Does token spending drive multi-agent research performance?

Anthropic's internal evals show token spending alone accounts for 80% of performance variance in multi-agent research systems. Model capability upgrades deliver larger gains than doubling token budget, suggesting efficiency matters as much as quantity.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

How do internal and external test-time scaling compare?

Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.

Do persistent agents really cost less per token?

A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about test-time scaling and token budgets in agentic deep research systems. The question remains: are test-time scaling and token budget fungible levers on the same axis, or do they decompose into distinct modes?

What a curated library found — and when (dated claims, not current truth): Findings span Feb 2025–May 2026.
• Search budget and reasoning tokens follow the same scaling law shape with diminishing returns; both are inference compute fungibles (2025–2026).
• ~80% of multi-agent research quality variance comes from raw token spend, not agent coordination (2026).
• Adaptive token allocation (harder prompts get more budget) beats uniform spending; flat budgets waste compute on easy tasks (2025).
• Non-reasoning models cannot match reasoning-capable models regardless of inference budget—training regime, not test-time compute, installs productivity (2025).
• In persistent agents with long-lived context, ~83% of "tokens" are cache reads; cost unit shifts from per-token to per-artifact (2026).

Anchor papers (verify; mind their dates):
• arXiv:2503.24235 (Mar 2025) – Survey on Test-Time Scaling
• arXiv:2506.18959 (Jun 2025) – Agentic Deep Research and Search Incentives
• arXiv:2604.02460 (Apr 2026) – Single-Agent vs. Multi-Agent Under Equal Thinking Budget
• arXiv:2605.26870 (May 2026) – Persistent Agents Case Study

Your task:
(1) RE-TEST THE FUNGIBILITY THESIS. For each constraint above, judge whether newer training methods (e.g., RL on search, post-training on reasoning), inference tooling (speculative decoding, adaptive compute), or architectural shifts (memory-augmented reasoning, hierarchical planning) have since relaxed or dissolved the boundary between search budget and reasoning tokens. Separate the durable question ("How do we optimally allocate finite compute?") from perishable limits ("search and reasoning are separate levers"). Cite what moved the regime.
(2) Surface the strongest DISAGREEMENT or SUPERSEDING work from the last 6 months. Does any recent paper contradict the 80% token-variance claim, adaptive allocation wins, or the cache-read economics of persistent agents?
(3) Propose 2 research questions that assume the regime may have shifted: one on whether training-time decisions (e.g., instruct-tuning for search) now dominate test-time allocation, and one on whether context caching and retrieval have inverted the cost structure so that token-budget framing is obsolete.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines