Deep Research Agents

Can schema-free graphs objectively evaluate open-ended search?

Can a directed graph with no preset structure capture the complexity of real search outputs while still enabling objective, fine-grained evaluation? This matters because existing evaluation methods trade objectivity for rigidity or richness for subjectivity.

Does search budget scale like reasoning tokens for answer quality?

Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.

Can agents compress their own memory without losing critical details?

Explores whether agents can autonomously consolidate interaction history into structured memory schemas that reduce token overhead while preserving information needed for long-horizon reasoning and strategic reflection.

What capabilities do AI systems need for autonomous science?

Explores whether current AI benchmarks actually measure what's required for independent scientific research—hypothesis generation, experimental design, data analysis, and self-correction—or if they test only adjacent skills.

Can codified expertise let non-experts match specialist output?

When domain knowledge is captured as explicit rules and principles in an AI agent's scaffolding, can non-experts produce work at expert quality levels without consuming scarce specialist time? This explores whether structured knowledge codification dissolves organizational bottlenecks.

Why do search agents beat memorized retrieval on hard questions?

Deep research agents trained on live web search outperform models fine-tuned on static knowledge. Does real-world RL's advantage come from smarter reasoning, or from bypassing the limitations of memorized facts?

What makes deep research fundamentally different from RAG?

Explores whether current systems using the label 'deep research' actually meet a rigorous three-component definition involving multi-step gathering, cross-source synthesis, and iterative refinement, or if they're performing something narrower.

Can agents discover tools dynamically instead of pre-selecting them?

Explore whether agents can find needed tools during execution rather than choosing from a fixed set upfront. This matters for long-horizon tasks where relevant tools cannot be known in advance.

Do hierarchical retrieval architectures outperform flat ones on complex queries?

Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.

Can models learn better by training on messy exploration paths?

Does including trial-and-error, reflection, and backtracking in training data teach models to reason more robustly than teaching only the polished shortest path to answers?

Does limiting reasoning per turn improve multi-turn search quality?

When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?

Does reinforcement learning squeeze exploration diversity in search agents?

Investigates whether RL training narrows the behavioral diversity of search agents the same way it does in reasoning tasks. Understanding this mechanism could reveal whether entropy collapse is fundamental to RL or domain-specific.

Why do search agents fail users despite strong benchmark scores?

Search evaluation benchmarks show high performance, yet real users remain unsatisfied. What gaps between test conditions and actual search behavior explain this disconnect?

Do search steps follow the same scaling rules as reasoning tokens?

Exploring whether the overthinking curve observed in reasoning models also appears in deep research agents. This matters because it could reveal universal scaling laws governing all inference-time compute.

Can simulated APIs and token-level credit assignment train better tool-using agents?

Training agents to use real APIs is expensive and unstable, and sparse rewards make it hard to credit the right tool calls. Can combining LLM simulators with fine-grained advantage attribution solve both problems?