Can models learn when to invoke search during reasoning tasks?
This explores whether a model can be trained to decide *for itself* when reasoning alone is enough versus when it should reach out and search — treating search as a learned action rather than a fixed pipeline step.
This explores whether a model can learn to decide when to invoke search mid-reasoning, rather than always searching or never searching. The corpus doesn't have a paper aimed squarely at "learning the search trigger," but it has the surrounding machinery — and read laterally, those pieces sketch a clear answer: yes, and the closest blueprint is models that already learn *when to think*. The sharpest analog is decoupled-RL routing, where a single model learns to switch between extended reasoning and quick direct answers without being told which problems are hard Can models learn when to think versus respond quickly?. Swap "think vs. answer" for "search vs. keep reasoning" and you have the same control problem: a self-calibrated gate over an expensive action.
What makes that gate worth learning is that search behaves like reasoning, economically. Agentic deep-research systems show search budget follows the same test-time scaling curve as reasoning tokens — more search iterations help, with diminishing returns — which means search and reasoning are interchangeable axes of inference compute the model can trade against each other Does search budget scale like reasoning tokens for answer quality?. Once two actions sit on comparable cost-benefit curves, "when to spend on which" becomes a learnable allocation policy, not a hardcoded rule.
The catch is that naive reasoning and search interfere. When an agent reasons without limit inside a single search turn, it burns the context window it needs to absorb the next round of retrieved evidence — so the fix is per-turn reasoning budgets, not just an overall time cap Does limiting reasoning per turn improve multi-turn search quality?. That's a strong hint that *when* to stop reasoning and go search is itself a decision with real consequences, exactly the kind of thing a learned policy should manage rather than leave to chance.
There's a deeper reason to expect this is trainable: the underlying capability is probably already latent. Base models appear to contain reasoning ability that minimal post-training merely *elicits* rather than installs Do base models already contain hidden reasoning ability?, and reasoning generalizes from broad procedural knowledge picked up in pretraining rather than from memorized facts Does procedural knowledge drive reasoning more than factual retrieval?. If knowing *how* to proceed is procedural and already present, then knowing *when to look something up* is plausibly the same kind of procedural skill waiting to be selected — and reward signals that need no human labels, like the model's own answer confidence, give you a way to train that judgment cheaply Can model confidence work as a reward signal for reasoning?.
The unsettling footnote: the corpus also warns that reasoning traces are partly theater. Corrupted, semantically wrong traces train nearly as well as correct ones, and traces read more as stylistic mimicry than as a faithful window into computation Do reasoning traces need to be semantically correct?, Do reasoning traces show how models actually think?. So a model that *says* "I should search here" may be performing the right-looking gesture rather than acting on a genuine internal estimate of its own ignorance — which is precisely why a learned trigger should be tied to outcome rewards (did the answer improve?) rather than to whether the reasoning narration looks sensible.
Sources 8 notes
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.
Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.
Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.