SYNTHESIS NOTE

Does search budget scale like reasoning tokens for answer quality?

Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.

Synthesis note · 2026-02-21 · sourced from Deep Research

The test-time scaling framework — more inference compute yields better answers up to a threshold — has been documented for reasoning token budgets in chain-of-thought models. The Agentic Deep Research finding extends this to search: more search steps, more retrieval rounds, better answers. The relationship follows the same shape.

This matters because it multiplies the design space for inference-time compute. Before, the question was "how many tokens to think?" Now there are two axes: reasoning budget per query and search budget per query. They are not independent — longer chains may require more retrieval to validate intermediate steps, and more retrieval may require more reasoning to synthesize. The optimal allocation problem gets harder.

The practical implication is that "deep research quality" is not a fixed property of a model — it is a function of the search budget you give it. A mid-sized model with a large search budget can outperform a large model with a restricted one. This shifts cost optimization from training compute to inference architecture, specifically the retrieval loop.

The finding also reframes what "thinking harder" means for agents. For single-turn reasoning models, thinking harder means more tokens per response. For search agents, thinking harder means more search-retrieve-synthesize iterations. How should we balance parallel versus sequential compute at test time? applies here too: the question of whether to parallelize retrieval across multiple query variants (parallel) or chain them iteratively (sequential) is the same structural trade-off operating at the retrieval level.

CoRAG (Chain-of-Retrieval Augmented Generation) extends this from agentic search behavior to explicitly trained retrieval models. Training via rejection sampling generates intermediate retrieval chains; test-time compute is controlled via decoding strategies (greedy / best-of-N / tree search). The same monotonic scaling relationship holds: more retrieval budget yields better answers on multi-hop QA. The TTS scaling law is not specific to reasoning tokens or agentic search — it is a general property of any iterative process with quality-sensitive intermediate steps. See Can retrieval be extended into multi-step chains like reasoning?.

Search-R1 and R1-Searcher demonstrate RL-based approaches that teach LLMs to autonomously invoke search during reasoning. Search-R1 (2025) uses retrieved token masking for stable RL training and a simple outcome-based reward, achieving 24% improvement (Qwen2.5-7B) over RAG baselines. The model learns multi-turn search with <search>/<information> token pairs. R1-Searcher (2025) introduces a two-stage approach: first a retrieve-reward incentivizes the model to conduct retrieval operations correctly, then an answer-reward encourages effective utilization of retrieved knowledge. Both demonstrate that RL training enables test-time scaling of tool calls — models learn to invoke search more frequently and more effectively as task difficulty increases, confirming the search-budget scaling law.

Inquiring lines that read this note 65

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What capability tradeoffs emerge when scaling model reasoning abilities?

Can models learn when to invoke search during reasoning tasks?

How should inference compute be adaptively allocated based on prompt difficulty?

How should retrieval systems optimize for multi-step reasoning during inference?

How should iterative research systems allocate reasoning per search step?

Do autonomous architecture discoveries follow predictable scaling laws?

Can the scaling law for discovery extend beyond architectures to agentic systems?

Can next-token prediction alone produce genuine language understanding?

What makes some tokens carry disproportionate information about answers?

What actually drives chain-of-thought reasoning improvements in language models?

How does the three-component definition apply to test-time scaling laws?

Can inference-time compute substitute for scaling up model parameters?

When do additional thinking tokens stop improving reasoning performance?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

How can LLM user simulators model realistic goal-driven conversation?

When does simulated search outperform real search for agent training?

Can debate mechanisms prevent silent agreement on wrong answers in multi-agent reasoning?

What role does search capacity play in making debate more accurate?

How do we evaluate AI systems when user perception misleads actual performance?

How do knowledge graphs enable efficient multi-hop reasoning over alternatives?

How do knowledge graphs scale as training data for open-ended search tasks?

What drives capability and cost efficiency in agent systems?

How do training data properties shape reasoning capability development?

How do timing and search internalization interact during reasoning post-training?

How should agents balance memory condensation to optimize context efficiency?

Should artifact-level benchmarks replace token counts for agent evaluation?

Can single-axis benchmarks accurately predict agent deployment success?

Can high benchmark scores mislead deployment decisions for search agents?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

Does policy entropy collapse prevent inference-time search from finding solutions?

When do multi-agent approaches outperform single model extended thinking?

How should experiment budgets be allocated across parallel hypothesis-testing teams?

Why do agents confidently report success despite actually failing tasks?

What other agent behaviors besides citations reveal reasoning quality?

Do harness improvements transfer across model scales or memorize shortcuts?

Do gains from harness-based agents transfer across different search benchmarks?

How does latent reasoning compare to verbalized chain-of-thought?

Can latent reasoning scale test-time compute without verbal tokens?

Can ensemble evaluation methods reduce bias more than single judges?

How does score granularity connect to verification as a scaling axis?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 164 in 2-hop network ·dense cluster Open in graph ↗

Does search budget scale like reasoning tokens f… Can we allocate inference compute based on prompt … How should we balance parallel versus sequential c… How do internal and external test-time scaling com…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
extends: search budget is now a second compute axis alongside reasoning tokens; adaptive allocation must account for both
How should we balance parallel versus sequential compute at test time? Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
applies: parallel retrieval (multiple query variants) vs sequential retrieval (chained iterations) is the same structural trade-off
How do internal and external test-time scaling compare? Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.
extends: search-based DR is the clearest case of external TTS; this finding quantifies its scaling behavior

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

agentic deep research exhibits a test-time scaling law where search budget determines answer quality creating a new inference-compute axis

Does search budget scale like reasoning tokens for answer quality?

Inquiring lines that read this note 65

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4