Do search steps follow the same scaling rules as reasoning tokens?
Exploring whether the overthinking curve observed in reasoning models also appears in deep research agents. This matters because it could reveal universal scaling laws governing all inference-time compute.
Writing angle — Medium/LinkedIn post.
Hook: The overthinking papers showed that more reasoning tokens helps — until it doesn't. Now the same curve is showing up in a completely different place: search. Deep research agents improve with more search budget following the same monotonic-then-degrading relationship. Scaling laws aren't just for training anymore. They're for every inference loop.
The claim: Test-time scaling generalizes from single-query reasoning to multi-step retrieval. The "search budget law" (Agentic Deep Research paper) shows that answer quality scales with search steps in a way that mirrors the relationship between reasoning quality and thinking tokens.
Why it matters:
- It means inference-compute optimization now has two levers: reasoning budget and search budget. The old question was "how many tokens should we think?" The new question is "how many retrieval rounds should we run, and how much reasoning per round?"
- It raises the same ceiling question: if reasoning has an overthinking threshold, does search? ASearcher's turn-limit finding suggests yes — unrestricted per-turn reasoning in iterative search loops degrades iterative quality, which means the search version of overthinking exists too.
- It reframes DR quality as an infrastructure decision as much as a model decision. A weaker model with more search budget can match a stronger model with a smaller one.
The synthesis: Does search budget scale like reasoning tokens for answer quality? + Does limiting reasoning per turn improve multi-turn search quality? together make the full argument: search has its own TTS curve, it follows similar shape, and it has its own overthinking variant.
Inquiring lines that use this note as a source 44
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do thinking tokens exhibit diminishing returns beyond a critical threshold?
- Why does retrieval chain training unlock scaling laws in QA?
- Can the scaling law for discovery extend beyond architectures to agentic systems?
- Can bilevel autoresearch discover new search mechanisms for the inner research loop?
- What distinguishes strategic fabrication from accidental hallucination in research agents?
- How do real search queries reveal what counts as a deep research question?
- Do task-specific heuristics improve gradually or appear suddenly at scale?
- Does inference-time compute scaling require explicit reasoning traces or verifiable rewards?
- Can multi-agent reasoning systems scale beyond current architectures?
- What role does confidence play in balancing overthinking versus underthinking?
- Does test-time compute scaling work for agentic deep research tasks?
- Does trading model size for inference steps improve overall efficiency scaling?
- How does overthinking in early turns degrade later retrieval rounds?
- Do autonomous architecture discoveries follow predictable scaling laws like human research?
- How does the Ladder of Scales approach reduce search costs across model sizes?
- How does test-time scaling relate to token budget in agentic deep research?
- Why do scaling laws show capability saturation at specific thresholds?
- Can extended deliberation in agents become counterproductive like human overthinking?
- How does speed of AI search prevent real-time supervision and evaluation?
- What scaling laws govern autonomous architecture discovery in AI systems?
- Why do reasoning chains degenerate into undirected exploration at scale?
- Why does overthinking degrade performance at extreme recursion depths?
- What distinguishes systematic search from wandering exploration in reasoning?
- How many particles and iterations does optimal expert discovery require?
- What makes search budget matter for research task performance?
- Do search agents face their own overthinking threshold like reasoning models do?
- What is the optimal balance between search rounds and reasoning depth per round?
- Why does more inference compute amplify wandering rather than solving it?
- Why do per-turn thinking budgets matter alongside iterative retrieval depth?
- Can runtime confidence signals detect when reasoning has crossed the overthinking threshold?
- Why do different model training approaches produce different overthinking thresholds?
- Does brute force experimentation substitute for research intuition and taste?
- Why do deep research agents outperform retrieval augmented generation systems?
- Can test-time scaling work through retrieval rather than reasoning?
- Which hyperparameter theories best explain universal behaviors across neural networks?
- How do timing and search internalization interact during reasoning post-training?
- What inference-time scaling benefits emerge from reasoning before each prediction?
- Why should scaling laws be understood as properties of data distribution rather than training in general?
- Does policy entropy collapse prevent inference-time search from finding solutions?
- How do search and reasoning workflows improve forecasting performance over base models?
- What power-law scaling patterns emerge when consistency models are trained at scale?
- Why do optimal learning dynamics improve scaling law coefficients specifically?
- What other agent behaviors besides citations reveal reasoning quality?
- Do scaling laws change when weight precision becomes a design variable?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does search budget scale like reasoning tokens for answer quality?
Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
grounds this angle
-
Does more thinking time actually improve LLM reasoning?
The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
extends: search faces the same assumption; the search budget law makes it empirically testable in the retrieval domain
-
Does limiting reasoning per turn improve multi-turn search quality?
When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?
provides the nuance: budget matters but so does per-turn allocation
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents
- Reasoning Models Can Be Effective Without Thinking
- ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
- Large Language Models Think Too Fast To Explore Effectively
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
- Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
- Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
Original note title
the search budget law — why deep research agents follow the same scaling rules as reasoning models