SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling Model Architecture and Internals

Why do search agents beat memorized retrieval on hard questions?

Deep research agents trained on live web search outperform models fine-tuned on static knowledge. Does real-world RL's advantage come from smarter reasoning, or from bypassing the limitations of memorized facts?

Synthesis note · 2026-02-21 · sourced from Deep Research

The DeepResearcher paper trains RL agents in live web search environments rather than simulated offline retrieval. The result: these agents outperform models fine-tuned on static knowledge on knowledge-intensive tasks. The mechanism is not that real-world RL produces a smarter reasoner — it is that real-world search bypasses the bottleneck that memorized retrieval creates.

Memorized knowledge has two failure modes that real-time search does not share. First, it is temporally bounded: anything that postdates training is simply absent. Second, it is probabilistically compressed: details that appear infrequently in training data are underrepresented or confabulated. Real-time search has neither constraint. When a query requires a specific fact from a recent paper or a niche domain, the search agent retrieves it rather than reconstructing it from training distribution.

This reframes what "knowledge-intensive" means for evaluation. A task that looks hard because it requires obscure facts is not testing reasoning ability — it is testing retrieval coverage. A model that scores poorly may reason perfectly well but have a knowledge gap. The DeepResearcher finding suggests the better benchmark design is to evaluate reasoning under conditions where retrieval is available, not reasoning alone.

The implication for deployment: model capability and retrieval access are substitutes, not complements, for factual tasks. Adding search to a mid-sized model may close the gap with a larger model that lacks search. The investment calculus shifts from training compute toward inference infrastructure.

UR2's difficulty-aware curriculum introduces a refinement: retrieval should be triggered selectively by query difficulty, not always. Easy questions can be answered from parametric knowledge; only hard questions warrant retrieval. This means parametric knowledge and external retrieval are not just substitutes at the system level — they are per-instance alternatives that a trained policy can select between. The per-instance switching policy further shifts the investment calculus toward smart retrieval routing rather than maximum retrieval coverage.

KG-synthesized training data for deep search agents: DeepDive demonstrates that the training data bottleneck for deep search agents — the scarcity of hard-to-find questions requiring long-horizon reasoning — can be solved by synthesizing questions from knowledge graphs. KG random walks of varying lengths control reasoning depth, while selective entity attribute blurring ("entity blurring") prevents shortcut solutions. Combined with multi-turn RL, DeepDive-32B achieves 14.8% on BrowseComp (hard-to-find information benchmark), setting a new open-source competitive result. The broader principle: KGs are ideal substrates for training data synthesis because they encode relational complexity while providing verifiable ground truth. See Can knowledge graphs generate training data for search agents?.

Inquiring lines that use this note as a source 15

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 154 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

deep research agents outperform rl-finetuned models on knowledge-intensive tasks because they replace memorized retrieval with real-world search