TOPIC

Deep Research Agents

15 synthesis notes · 53 source papers
View as

Can schema-free graphs objectively evaluate open-ended search?

Can a directed graph with no preset structure capture the complexity of real search outputs while still enabling objective, fine-grained evaluation? This matters because existing evaluation methods trade objectivity for rigidity or richness for subjectivity.

Explore related Read →

Does search budget scale like reasoning tokens for answer quality?

Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.

Explore related Read →

Can agents compress their own memory without losing critical details?

Explores whether agents can autonomously consolidate interaction history into structured memory schemas that reduce token overhead while preserving information needed for long-horizon reasoning and strategic reflection.

Explore related Read →

What capabilities do AI systems need for autonomous science?

Explores whether current AI benchmarks actually measure what's required for independent scientific research—hypothesis generation, experimental design, data analysis, and self-correction—or if they test only adjacent skills.

Explore related Read →

Can codified expertise let non-experts match specialist output?

When domain knowledge is captured as explicit rules and principles in an AI agent's scaffolding, can non-experts produce work at expert quality levels without consuming scarce specialist time? This explores whether structured knowledge codification dissolves organizational bottlenecks.

Explore related Read →

Why do search agents beat memorized retrieval on hard questions?

Deep research agents trained on live web search outperform models fine-tuned on static knowledge. Does real-world RL's advantage come from smarter reasoning, or from bypassing the limitations of memorized facts?

Explore related Read →

What makes deep research fundamentally different from RAG?

Explores whether current systems using the label 'deep research' actually meet a rigorous three-component definition involving multi-step gathering, cross-source synthesis, and iterative refinement, or if they're performing something narrower.

Explore related Read →

Can agents discover tools dynamically instead of pre-selecting them?

Explore whether agents can find needed tools during execution rather than choosing from a fixed set upfront. This matters for long-horizon tasks where relevant tools cannot be known in advance.

Explore related Read →

Do hierarchical retrieval architectures outperform flat ones on complex queries?

Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.

Explore related Read →

Can models learn better by training on messy exploration paths?

Does including trial-and-error, reflection, and backtracking in training data teach models to reason more robustly than teaching only the polished shortest path to answers?

Explore related Read →

Does limiting reasoning per turn improve multi-turn search quality?

When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?

Explore related Read →

Does reinforcement learning squeeze exploration diversity in search agents?

Investigates whether RL training narrows the behavioral diversity of search agents the same way it does in reasoning tasks. Understanding this mechanism could reveal whether entropy collapse is fundamental to RL or domain-specific.

Explore related Read →

Why do search agents fail users despite strong benchmark scores?

Search evaluation benchmarks show high performance, yet real users remain unsatisfied. What gaps between test conditions and actual search behavior explain this disconnect?

Explore related Read →

Do search steps follow the same scaling rules as reasoning tokens?

Exploring whether the overthinking curve observed in reasoning models also appears in deep research agents. This matters because it could reveal universal scaling laws governing all inference-time compute.

Explore related Read →

Can simulated APIs and token-level credit assignment train better tool-using agents?

Training agents to use real APIs is expensive and unstable, and sparse rewards make it hard to credit the right tool calls. Can combining LLM simulators with fine-grained advantage attribution solve both problems?

Explore related Read →

Source papers 53

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.