How do users actually form intent when prompting AI systems?
Users face a 'gulf of envisioning'—they must simultaneously imagine possibilities and express them to language models. This cognitive gap creates breakdowns not from AI incapability but from users struggling to articulate what they truly need.
The STORM framework names a fundamental gap in human-AI interaction: the "gulf of envisioning." Unlike conventional interfaces with predictable affordances, language models require users to simultaneously envision possibilities AND express them. This cognitive difficulty produces communication breakdowns — not because the AI is incapable, but because the user cannot articulate a prompt that captures what they actually need.
The deeper formalization is that human intent formation involves progressive constraint resolution with fluctuating stability intervals and distinct structural signaling patterns. Intent is not binary (present or absent, clear or ambiguous). It MATURES through interaction — starting vague, acquiring constraints, stabilizing, sometimes destabilizing when new information arrives, then reconsolidating. Current evaluation methods fail because they: (1) treat intent as binary, (2) lack frameworks for temporal coherence, and (3) overlook structural signals within expressions.
STORM models this through asymmetric information dynamics: UserLLM has full access to internal states (preferences, emotions, background) while AgentLLM has only observable dialogue history. This asymmetry mirrors real human-AI interaction — the AI cannot access the user's unstated context, unresolved preferences, or evolving understanding of their own needs.
The novel Clarify metric measures whether agent responses genuinely improve users' understanding of their own needs — assessed through analysis of simulated user inner thoughts rather than external expressions. This captures an invisible cognitive process: a user may SAY "thanks, that's helpful" while internally remaining confused about what they actually want.
Since Why do language models fail in gradually revealed conversations?, the STORM framing reframes this not as pure AI failure but as a joint user-AI failure. The user's expressions contain structural signals — stylistic choices, implicit assumptions, cultural markers — that reflect what Wittgenstein called contextual embeddedness within "forms of life." Current systems cannot access these embedded cues.
The practical implications: satisfaction derived from inner thoughts (internal contentment), clarification effectiveness (Clarify metric), and Satisfaction-Seeking Actions (SSA — composite of both) provide three complementary evaluation dimensions that together capture what single-metric evaluation misses.
The original gulf of envisioning paper (Zamfirescu-Pereira et al., 2023) defines three specific misalignment gaps: (1) the capability gap — not knowing what the task should be (what can the LLM even do?); (2) the instruction gap — not knowing how to best instruct the LLM about goals (prompt engineering difficulty); (3) the intentionality gap — not knowing what to expect for the LLM's output in meeting the goal. The paper notes that traditional HCI inadvertently bypassed intention formation because conventional interfaces have fixed command vocabularies — clicking "Bold" doesn't require envisioning what boldness means. LLM interfaces require envisioning at every step. The iterative process resembles a "20-questions" or "Hot or Cold" guessing game that may be inefficient for longer output and lead to local minima within the solution space. Further, humans show fixation on initial examples that interfere with exploring alternative solutions.
UserBench (2025) quantifies the downstream consequences: models provide answers that fully align with ALL user intents only 20% of the time, and even the best models uncover fewer than 30% of user preferences through active interaction. The three core traits of user communication — underspecification, incrementality, indirectness — are not edge cases but the default condition.
A concrete domain-specific validation comes from LP (linear programming) dialogue research: individuals without specialized mathematical backgrounds "often struggle to formulate the appropriate linear models for their specific problem instances." The proposed solution — a two-agent synthetic dialogue system where one agent simulates the conversational assistant and the other emulates the user — is specifically designed to elicit information the user possesses but cannot organize into a formal structure. This is a clean instance of the gulf of envisioning: the user has the problem knowledge (constraints, objectives) but literally cannot state it as a model without conversational assistance. Mathematical problem formulation thus serves as a particularly transparent example of intent maturation — the user's "intent" (to solve their LP problem) is real but unformulable without guided dialogue.
Inquiring lines that use this note as a source 12
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why can't users and AI articulate shared goals together?
- How does anomalous knowledge state connect to the gulf of envisioning?
- How should designers make invisible AI state legible to users?
- Can users articulate what they want before AI helps them discover it?
- How do users fail to articulate what they actually want?
- Can prompt engineering overcome the gulf between user intent and AI interpretation?
- Can users articulate their intent before exploring what an AI system finds?
- Why do AI models treat user intent as binary rather than evolving?
- What makes evaluation easier than envisioning for users?
- What tasks do users actually want AI to handle versus what can it automate?
- What stops AI from helping users articulate preferences they cannot express?
- How does context engineering bridge human intent and machine understanding?
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do language models fail in gradually revealed conversations?
Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
STORM reframes premature assumptions as failures to track intent maturation
-
Why can't advanced AI models take initiative in conversation?
Despite extraordinary capability in answering and reasoning, LLMs fundamentally cannot initiate, redirect, or guide exchanges. Understanding this gap—and whether it's fixable—matters for building AI that truly collaborates rather than merely responds.
the gulf of envisioning is the user-side complement to the AI-side passivity problem
-
Which clarifying questions actually improve user satisfaction?
Not all clarification helps equally. This explores whether asking users to rephrase their needs works as well as asking targeted questions about specific information gaps.
clarification is the bridge across the gulf of envisioning
-
Why do language models lose performance in longer conversations?
Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.
intent alignment gap connects directly to intent maturation
-
Why do AI agents miss most of what users actually want?
UserBench explores why current models align with user intent only 20% of the time, even when users reveal preferences across multiple turns. The question examines whether agents can learn to actively clarify ambiguous or evolving goals.
quantifies the gulf of envisioning's consequences
-
Can models identify what information they actually need?
When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
QuestBench reveals models cannot even identify what information is missing (40-50% accuracy), so they cannot help users mature underspecified intent
-
Why do reasoning models overthink ill-posed questions?
Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
when users provide incomplete intent, reasoning models overthink rather than recognizing the gap and asking for clarification
-
Why do users drift away from their original information need?
When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
ASK is the upstream cognitive cause of the gulf: the user's knowledge state is anomalous in a way that prevents intent articulation, producing the topic drift that the gulf predicts
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- WHEN TO ACT, WHEN TO WAIT: Modeling Structural Trajectories for Intent Triggerability in Task-Oriented Dialogue
- Bridging the gulf of envisioning: Cognitive design challenges in llm interfaces.
- UserBench: An Interactive Gym Environment for User-Centric Agents
- A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
- Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation
- Proactive Conversational Agents with Inner Thoughts
- Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
- Beyond Hallucinations: The Illusion of Understanding in Large Language Models
Original note title
intent formation is a continuous maturation process not a binary state — the gulf of envisioning means users cannot formulate what they want while AI cannot help them evolve their intent