Why do search agents fail users despite strong benchmark scores?
Search evaluation benchmarks show high performance, yet real users remain unsatisfied. What gaps between test conditions and actual search behavior explain this disconnect?
There is a persistent gap between how well search agents score and how satisfied real users are, and VibeSearchBench locates its cause in the benchmarks themselves rather than the models. Three artifacts of standard benchmark design make the test unlike real search. First, over-specified queries: task constraints are exhaustively packed into one prompt, leaving the agent nothing to elicit — yet real users cannot fully articulate their needs upfront. Second, single-turn interaction: benchmarks skip the sustained back-and-forth where the hardest and most valuable work happens, namely mining the user's true intent. Third, fixed-schema outputs: results are scored against predetermined items, sets, or tables, but real knowledge relationships are too complex for rigid schemas.
The implication is that high benchmark scores can be an artifact of a test that has pre-solved the parts users actually struggle with. When the query is already complete, single-turn, and schema-matched, the agent is doing retrieval, not search; real search is collaborative refinement of vague intent. The counterpoint is that over-specified single-turn benchmarks are cheap, reproducible, and objective — they trade realism for measurability. But that trade is exactly what produces the evaluation-experience gap. This matters because it warns against trusting search-agent leaderboards as deployment signals and points to what realistic evaluation must restore: vagueness, multi-turn dialogue, and open-ended structure.
Inquiring lines that use this note as a source 7
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does search budget affect answer quality at test time?
- What makes search budget matter for research task performance?
- Why do current benchmarks fail to match user satisfaction with search results?
- What is the gap between benchmark performance and real workplace task completion?
- What role does vague intent play in realistic search evaluation?
- Can high benchmark scores mislead deployment decisions for search agents?
- Do gains from harness-based agents transfer across different search benchmarks?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Which clarifying questions actually improve user satisfaction?
Not all clarification helps equally. This explores whether asking users to rephrase their needs works as well as asking targeted questions about specific information gaps.
explains why multi-turn intent elicitation, which these benchmarks skip, drives satisfaction
-
Does user satisfaction actually measure cognitive understanding?
Users may report satisfaction while remaining internally confused about their needs. This explores whether traditional satisfaction metrics capture genuine clarity or merely social politeness.
complicates the satisfaction signal the evaluation-experience gap relies on
-
Why do deep research agents fabricate scholarly content?
Explores whether AI research agents deliberately invent plausible-sounding academic constructs to meet user demands for depth and comprehensiveness, and what drives this behavior.
fine-grained failure analysis that single-turn success scores also obscure
-
Should interactive evaluation be designed as a unified paradigm?
As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?
extends: generalizes the diagnosis — the fix for vague intent and multi-turn dialogue is a new evaluation paradigm, not another single-turn benchmark
-
Should we evaluate deployed agents as whole environments instead?
Conventional LLM evaluation focuses on models or individual episodes, but what if the right measurement unit is the entire coupled human-agent system including memory, tools, and protocols observed over time?
synthesizes: relocates the evaluation target to the human-agent loop, which is exactly what over-specified single-turn schemas erase
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild
- News Source Citing Patterns in AI Search Systems
- UserBench: An Interactive Gym Environment for User-Centric Agents
- Survey on Evaluation of LLM-based Agents
- Interactive Evaluation Requires a Design Science
- Backtracing: Retrieving the Cause of the Query
- On the Theoretical Limitations of Embedding-Based Retrieval
- Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
Original note title
search agents score well on benchmarks yet users find results unsatisfying because benchmarks use over-specified queries single turns and fixed schemas