SYNTHESIS NOTE

Do search steps follow the same scaling rules as reasoning tokens?

Exploring whether the overthinking curve observed in reasoning models also appears in deep research agents. This matters because it could reveal universal scaling laws governing all inference-time compute.

Synthesis note · 2026-02-21 · sourced from Deep Research

Writing angle — Medium/LinkedIn post.

Hook: The overthinking papers showed that more reasoning tokens helps — until it doesn't. Now the same curve is showing up in a completely different place: search. Deep research agents improve with more search budget following the same monotonic-then-degrading relationship. Scaling laws aren't just for training anymore. They're for every inference loop.

The claim: Test-time scaling generalizes from single-query reasoning to multi-step retrieval. The "search budget law" (Agentic Deep Research paper) shows that answer quality scales with search steps in a way that mirrors the relationship between reasoning quality and thinking tokens.

Why it matters:

It means inference-compute optimization now has two levers: reasoning budget and search budget. The old question was "how many tokens should we think?" The new question is "how many retrieval rounds should we run, and how much reasoning per round?"
It raises the same ceiling question: if reasoning has an overthinking threshold, does search? ASearcher's turn-limit finding suggests yes — unrestricted per-turn reasoning in iterative search loops degrades iterative quality, which means the search version of overthinking exists too.
It reframes DR quality as an infrastructure decision as much as a model decision. A weaker model with more search budget can match a stronger model with a smaller one.

The synthesis: Does search budget scale like reasoning tokens for answer quality? + Does limiting reasoning per turn improve multi-turn search quality? together make the full argument: search has its own TTS curve, it follows similar shape, and it has its own overthinking variant.

Inquiring lines that read this note 48

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

When do additional thinking tokens stop improving reasoning performance?

How should iterative research systems allocate reasoning per search step?

Do autonomous architecture discoveries follow predictable scaling laws?

Why do self-improving systems struggle without clear external performance metrics?

Can bilevel autoresearch discover new search mechanisms for the inner research loop?

Why do agents confidently report success despite actually failing tasks?

How does example difficulty affect learning efficiency in language models?

Do task-specific heuristics improve gradually or appear suddenly at scale?

Can inference-time compute substitute for scaling up model parameters?

Can model confidence signals reliably improve reasoning quality and calibration?

When does architectural design matter more than raw model capacity?

How can identical external performance mask different internal representations?

Why do scaling laws show capability saturation at specific thresholds?

How do we evaluate AI systems when user perception misleads actual performance?

Why do reasoning models fail at systematic problem-solving and search?

Why do reasoning chains degenerate into undirected exploration at scale?

How does reasoning graph topology affect breakthrough insights and generalization?

What distinguishes systematic search from wandering exploration in reasoning?

Which computational strategies best support reasoning in language models?

How many particles and iterations does optimal expert discovery require?

What limits mechanistic interpretability's ability to characterize models?

Which hyperparameter theories best explain universal behaviors across neural networks?

How do training data properties shape reasoning capability development?

How do timing and search internalization interact during reasoning post-training?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

Does policy entropy collapse prevent inference-time search from finding solutions?

When do multi-agent approaches outperform single model extended thinking?

How does multi-agent reasoning scale compared to single-model approaches?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 124 in 2-hop network ·dense cluster Open in graph ↗

Do search steps follow the same scaling rules as… Does search budget scale like reasoning tokens for… Does more thinking time actually improve LLM reaso… Does limiting reasoning per turn improve multi-tur…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does search budget scale like reasoning tokens for answer quality? Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
grounds this angle
Does more thinking time actually improve LLM reasoning? The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
extends: search faces the same assumption; the search budget law makes it empirically testable in the retrieval domain
Does limiting reasoning per turn improve multi-turn search quality? When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?
provides the nuance: budget matters but so does per-turn allocation

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

the search budget law — why deep research agents follow the same scaling rules as reasoning models

Do search steps follow the same scaling rules as reasoning tokens?

Inquiring lines that read this note 48

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4