INQUIRING LINE

Does test-time compute scaling work for agentic deep research tasks?

This explores whether throwing more inference-time compute at a research agent — letting it search and reason longer rather than retraining or enlarging the model — actually buys better answers, and where the gains taper off.


This explores whether throwing more inference-time compute at a research agent — letting it search and reason longer rather than swapping in a bigger model — actually buys better answers. The corpus answers yes, and with an interesting twist: the thing that scales isn't just reasoning, it's *search itself*. Several notes converge on the finding that an agent's search budget follows the same scaling curve as its reasoning tokens — more search steps improve answer quality along the same monotonic-then-diminishing path that more chain-of-thought does Does search budget scale like reasoning tokens for answer quality? Do search steps follow the same scaling rules as reasoning tokens?. The reframing is the payoff: search stops being plumbing and becomes a *compute axis* you can dial, trading reasoning budget against retrieval budget to optimize the result How does search scale like reasoning in agent systems?.

But 'just spend more tokens' comes with sharp caveats. At the multi-agent level, roughly 80% of performance variance turns out to be a function of token spend rather than any clever coordination between agents — which is freeing (you know what knob to turn) and sobering (much of the apparent intelligence is just budget) How does test-time scaling work at the agent level?. And spending only helps if the model was trained to use it: non-reasoning models don't catch up to reasoning models no matter how much inference compute you pour in, because training instills a protocol that makes the extra tokens productive in the first place Can non-reasoning models catch up with more compute?. So test-time scaling extracts capability the model already has — it doesn't manufacture it. That's the internal-vs-external split worth knowing: internal scaling (training for autonomous reasoning) builds the capability, external scaling (inference-time search and verification) cashes it out, and the two complement rather than substitute How do internal and external test-time scaling compare?.

The deeper trade is that inference compute and model size are not separate resources. On hard prompts, a smaller model given more thinking time matches a larger one — compute spent at inference can stand in for compute spent on parameters Can inference compute replace scaling up model size?. The catch is that uniform spending is wasteful: dumping a fixed budget on every query overpays for easy questions and starves the hard ones, so adaptive per-prompt allocation beats flat budgets How should we allocate compute budget at inference time?. For a research agent fielding a mix of trivial and gnarly sub-questions, that's the real lever — not 'more,' but 'more where it counts.'

What the reader might not expect: scaling doesn't have to mean *deeper*. You can scale *wider* by sampling parallel trajectories instead of one long serial chain, sidestepping the latency tax of depth Can reasoning systems scale wider instead of only deeper?. And there's a striking inversion at the extreme — decompose a task into enough tiny verified subtasks with voting at each step, and small non-reasoning models execute million-step jobs error-free, no expensive reasoning model required Can extreme task decomposition enable reliable execution at million-step scale?. Two routes to reliability that don't look like 'bigger budget' at all.

So: test-time scaling genuinely works for agentic deep research, but the honest version is layered — search scales like reasoning, most multi-agent gains are just spend, the spend only pays off on models trained to reason, and the smartest agents allocate adaptively, sometimes going wide or decomposing rather than simply spending more. If you want the structural picture of how all this fits together, the internal/external taxonomy is the doorway How do internal and external test-time scaling compare?.


Sources 10 notes

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

How does search scale like reasoning in agent systems?

Test-time scaling laws generalize from reasoning to retrieval: search steps follow identical scaling curves to reasoning tokens, making deep research a test-time scaling problem. This insight reframes search as a compute axis comparable to chain-of-thought reasoning.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

How do internal and external test-time scaling compare?

Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can extreme task decomposition enable reliable execution at million-step scale?

MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about test-time compute scaling in agentic deep research. The question: does allocating more inference-time compute (search budget, reasoning tokens, parallel sampling) to a research agent genuinely improve answer quality, or is the apparent gain mostly artifact of model training, task decomposition, or evaluation setup?

What a curated library found — and when (2025–2026, dated claims not current truth):
• Search budget follows the same monotonic-then-diminishing scaling curve as reasoning tokens; both are compute axes you can trade off (2025–06, agentic deep research).
• ~80% of multi-agent performance variance is token spend, not coordination — revealing that much apparent agent intelligence is budget-driven (2025–06).
• Non-reasoning models cannot catch up to reasoning-trained models no matter inference budget, because training instills the protocol to use extra tokens productively (2025–02, 2025–04).
• On hard prompts, smaller models + more thinking time match larger ones; inference compute can substitute for parameter scaling (2025–02, 2025–06).
• Adaptive per-prompt allocation beats flat budgets; wide sampling (parallel trajectories) can outperform deep serial chains, sidestepping latency (2025–06, 2025–07).
• Extreme task decomposition into microagents with voting enables small non-reasoning models to execute million-step tasks error-free (2025–11).

Anchor papers (verify; mind their dates):
• arXiv:2506.18959 (2025–06): From Web Search towards Agentic Deep Research — establishes search-as-compute framing.
• arXiv:2506.04210 (2025–06): Does Thinking More always Help? — surfaces conditions where extra compute hurts or saturates.
• arXiv:2604.02460 (2026–04): Single-Agent LLMs Outperform Multi-Agent Systems — recent, directly challenges multi-agent scaling narrative.
• arXiv:2511.09030 (2025–11): Solving a Million-Step LLM Task with Zero Errors — demonstrates decomposition as alternative to reasoning scaling.

Your task:
(1) RE-TEST THE CORE TENSION: The library claims test-time scaling works *if* models are trained for reasoning, yet also claims decomposition + voting sidesteps reasoning models entirely. Reconcile: does this mean scaling is genuinely regime-dependent (training matters more than budget), or is the library conflating two separate phenomena (internal reasoning scaling vs. external task decomposition)? For each claim, judge whether newer work (Jan 2026–present) has shown training-agnostic scaling methods, or whether the training prerequisite remains inescapable.
(2) Surface the sharpest CONTRADICTION in the library: arXiv:2604.02460 (2026–04) reports single-agent outperforming multi-agent under equal thinking budget — directly undermining the ~80% token-spend claim. Dig into whether this reversal is method-specific (agent architecture), domain-specific (reasoning vs. retrieval), or signals a deeper shift in how multi-agent systems should be designed.
(3) Propose two new research questions that assume the regime *has* moved:
   — If adaptive per-prompt allocation is now standard, what new bottleneck emerges (e.g., allocation oracle accuracy, or hidden compute waste in switching costs)?
   — If decomposition can bypass reasoning models, what is the minimal training signal needed to enable zero-error million-step execution, and does it still count as test-time scaling?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines