SYNTHESIS NOTE

Why do AI agents miss most of what users actually want?

UserBench explores why current models align with user intent only 20% of the time, even when users reveal preferences across multiple turns. The question examines whether agents can learn to actively clarify ambiguous or evolving goals.

Synthesis note · 2026-02-23 · sourced from Design Frameworks

UserBench evaluates agents in multi-turn, preference-driven interactions where simulated users start with underspecified goals and reveal preferences incrementally. The results quantify a gap that existing benchmarks obscure:

Models provide answers that fully align with ALL user intents only 20% of the time on average
Even the most advanced models uncover fewer than 30% of all user preferences through active interaction
Scores drop by over 40% when models must select only one option per dimension (forcing commitment rather than hedging)

The framework identifies three core traits of human communication that make this hard:

Underspecification — users initiate requests before fully formulating their goals
Incrementality — intent emerges and evolves across interaction turns
Indirectness — users obscure or soften their true intent due to social or strategic reasons

These are not edge cases — they are the default condition of human communication. Language is inherently ambiguous (Clark, 1996; Liu et al., 2023), and meaning is co-constructed through interaction.

The disconnect between task completion and user alignment is the critical finding. Standard benchmarks measure whether an agent completes a task — UserBench measures whether the agent completed the right task, from the user's perspective. Current models are task-capable but not user-aligned.

This connects to Why can't users articulate what they want from AI? — the 20% figure quantifies the double gap. And since How do users actually form intent when prompting AI systems?, the incrementality trait confirms that intent-as-binary is a design error, not an edge case.

The finding that models elicit <30% of preferences through active querying connects to Can models learn to ask clarifying questions instead of guessing? — proactive questioning is trainable (0.15% → 73.98%) but is not standard in current deployments.

Inquiring lines that read this note 20

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

When should tasks involve human-AI partnership versus full automation?

How can humans calibrate appropriate trust in AI systems?

What makes users willing to relinquish control to an agent?

How should conversational agents balance goal-driven initiative with user control?

How can we distinguish genuine user preferences from measurement artifacts?

How can we measure whether a user actually understands their own needs?

How do we evaluate AI systems when user perception misleads actual performance?

How do professional roles and expertise transform with AI-generated content?

Why do AI products default to service roles when users seek different kinds of help?

How do interface design choices shape consciousness attribution?

How does machine agency spectrum explain tool design mismatches with user behavior?

How does AI adoption affect human skill development and labor equality?

Why do 41 percent of AI startups target zones workers actually resist?

Why do benchmark improvements fail to reflect actual reasoning quality?

How do static benchmarks fail to capture human preference alignment?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 133 in 2-hop network ·medium cluster Open in graph ↗

Why do AI agents miss most of what users actuall… Why can't users articulate what they want from AI? How do users actually form intent when prompting A… Can models learn to ask clarifying questions inste… Why do language models fail in gradually revealed … Why can't advanced AI models take initiative in co… Why do search agents fail users despite strong ben…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why can't users articulate what they want from AI? Explores the cognitive gap between imagining possibilities and expressing them as prompts. Why language interfaces create a harder envisioning task than traditional UI affordances.
the 20% figure quantifies the double gap
How do users actually form intent when prompting AI systems? Users face a 'gulf of envisioning'—they must simultaneously imagine possibilities and express them to language models. This cognitive gap creates breakdowns not from AI incapability but from users struggling to articulate what they truly need.
incrementality confirms intent maturation
Can models learn to ask clarifying questions instead of guessing? Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
proactive questioning addresses the preference elicitation gap
Why do language models fail in gradually revealed conversations? Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
premature assumptions are the mechanism behind the 20%
Why can't advanced AI models take initiative in conversation? Despite extraordinary capability in answering and reasoning, LLMs fundamentally cannot initiate, redirect, or guide exchanges. Understanding this gap—and whether it's fixable—matters for building AI that truly collaborates rather than merely responds.
passivity prevents preference discovery
Why do search agents fail users despite strong benchmark scores? Search evaluation benchmarks show high performance, yet real users remain unsatisfied. What gaps between test conditions and actual search behavior explain this disconnect?
grounds: quantifies the multi-turn intent-elicitation gap these single-turn benchmarks hide

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

agents fully align with all user intents only 20 percent of the time — even best models elicit fewer than 30 percent of preferences through active querying

Why do AI agents miss most of what users actually want?

Inquiring lines that read this note 20

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4