INQUIRING LINE

Can models learn when to invoke search during reasoning tasks?

This explores whether a model can be trained to decide *for itself* when reasoning alone is enough versus when it should reach out and search — treating search as a learned action rather than a fixed pipeline step.


This explores whether a model can learn to decide when to invoke search mid-reasoning, rather than always searching or never searching. The corpus doesn't have a paper aimed squarely at "learning the search trigger," but it has the surrounding machinery — and read laterally, those pieces sketch a clear answer: yes, and the closest blueprint is models that already learn *when to think*. The sharpest analog is decoupled-RL routing, where a single model learns to switch between extended reasoning and quick direct answers without being told which problems are hard Can models learn when to think versus respond quickly?. Swap "think vs. answer" for "search vs. keep reasoning" and you have the same control problem: a self-calibrated gate over an expensive action.

What makes that gate worth learning is that search behaves like reasoning, economically. Agentic deep-research systems show search budget follows the same test-time scaling curve as reasoning tokens — more search iterations help, with diminishing returns — which means search and reasoning are interchangeable axes of inference compute the model can trade against each other Does search budget scale like reasoning tokens for answer quality?. Once two actions sit on comparable cost-benefit curves, "when to spend on which" becomes a learnable allocation policy, not a hardcoded rule.

The catch is that naive reasoning and search interfere. When an agent reasons without limit inside a single search turn, it burns the context window it needs to absorb the next round of retrieved evidence — so the fix is per-turn reasoning budgets, not just an overall time cap Does limiting reasoning per turn improve multi-turn search quality?. That's a strong hint that *when* to stop reasoning and go search is itself a decision with real consequences, exactly the kind of thing a learned policy should manage rather than leave to chance.

There's a deeper reason to expect this is trainable: the underlying capability is probably already latent. Base models appear to contain reasoning ability that minimal post-training merely *elicits* rather than installs Do base models already contain hidden reasoning ability?, and reasoning generalizes from broad procedural knowledge picked up in pretraining rather than from memorized facts Does procedural knowledge drive reasoning more than factual retrieval?. If knowing *how* to proceed is procedural and already present, then knowing *when to look something up* is plausibly the same kind of procedural skill waiting to be selected — and reward signals that need no human labels, like the model's own answer confidence, give you a way to train that judgment cheaply Can model confidence work as a reward signal for reasoning?.

The unsettling footnote: the corpus also warns that reasoning traces are partly theater. Corrupted, semantically wrong traces train nearly as well as correct ones, and traces read more as stylistic mimicry than as a faithful window into computation Do reasoning traces need to be semantically correct?, Do reasoning traces show how models actually think?. So a model that *says* "I should search here" may be performing the right-looking gesture rather than acting on a genuine internal estimate of its own ignorance — which is precisely why a learned trigger should be tied to outcome rewards (did the answer improve?) rather than to whether the reasoning narration looks sensible.


Sources 8 notes

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher tasked with re-evaluating whether models can learn *when* to invoke search during reasoning—a question a curated library explored via papers on learned reasoning triggers and test-time compute allocation (2024–2026). Treat the findings below as dated claims; your job is to test whether newer models, training methods, or evals have shifted the answer.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026 and center on four constraints:
• Models can learn *when to think* via decoupled RL routing, treating "search vs. reason" as a learnable allocation policy on comparable cost-benefit curves (~2025, arXiv:2505.13379).
• Search and reasoning share a test-time scaling law; per-turn reasoning budgets (not global caps) are needed to prevent context interference (~2025, arXiv:2506.18959).
• Reasoning ability is latent in base models and elicited by minimal post-training; procedural knowledge from pretraining drives generalization (~2024–2025, arXiv:2411.12580, arXiv:2604.15726).
• Reasoning traces are partly stylistic mimicry; corrupted traces train nearly as well as correct ones, so learned triggers should rely on outcome rewards, not trace semantics (~2025, arXiv:2505.20296).

Anchor papers (verify; mind their dates):
• arXiv:2505.13379 (Thinkless, ~May 2025): decoupled RL learns when to engage extended thinking.
• arXiv:2506.18959 (Agentic Deep Research, ~June 2025): search and reasoning exhibit shared test-time scaling.
• arXiv:2604.15726 (LLM Reasoning Is Latent, ~April 2026): reasoning resides in base models, not chain-of-thought.
• arXiv:2507.02962 (RAG-R1, ~July 2025): multi-query parity incentivizes search–reasoning coupling.

Your task:
(1) RE-TEST EACH CONSTRAINT. For "decoupled RL learns when to trigger," does newer work on multi-agent orchestration, memory hierarchies, or adaptive routing show this scales beyond toy domains? For "per-turn budgets prevent interference," have longer context windows or flash-attention variants dissolved that bottleneck? For "latent reasoning," do recent evals still find base-model capability or has scaling changed the picture? For "traces are theater," do outcome-weighted RL methods (vs. supervised fine-tuning) actually solve the spurious-correlation problem, or do they merely hide it? Cite what resolves each; state plainly where the constraint still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—papers that claim models *cannot* learn search triggers, or that hardcoded rules outperform learned policies.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Can a single unified policy learn to allocate compute across search, reasoning, and tool use simultaneously?" or "Do learned search triggers transfer across domains, or are they brittle to distribution shift?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines