Why do traditional interfaces bypass the intention formation problem that language models expose?
This explores why button-and-menu interfaces never had to solve the problem of getting a user to articulate what they actually want — they pre-form the intentions for you — while language models hand that burden back to the user and reveal how messy human intent really is.
This explores why traditional interfaces sidestep a problem language models can't: turning a vague human want into a fully-specified instruction. A GUI never asks you to *say* what you mean. It hands you a finite menu of pre-formed intentions — a button, a slider, a checkbox — and your job is just to pick. The designer already did the intention-forming work and froze it into structure. There's no underspecification to recover from, because the interface won't let you express anything it can't act on. That same logic shows up even inside agent design: GUI agents work better when you factor planning and grounding into separate structured paths rather than forcing open-ended end-to-end prediction Can structured interfaces help language models control GUIs better?, and classic dialogue systems handled understanding by mapping speech onto a closed set of commands Can command generation replace intent classification in dialogue systems?. Structure absorbs ambiguity before it becomes a problem.
Language models tear that scaffolding away. By accepting open-ended natural language, they shift the entire weight of forming and specifying intent onto the user — and that's exactly where things break. Across 200,000+ conversations, models lose 39% of their performance in multi-turn settings because they lock onto premature guesses when a request is revealed gradually rather than stated completely up front Why do language models fail in gradually revealed conversations?. A menu can't make a premature assumption; a model trying to infer your intent from a half-formed sentence does it constantly. The interface that demanded a complete, well-formed intention never exposed this fragility because it never permitted an incomplete one.
Worse, the obvious fix — have the model just *ask* what you mean — runs against how these systems are trained. Standard RLHF rewards immediate helpfulness, which quietly teaches models to answer passively instead of pausing to discover intent through clarifying questions Why do language models respond passively instead of asking clarifying questions?. So the one capability that could compensate for open-ended input — active intent discovery — is the one the training regime discourages. And when the user's framing is actually wrong, models tend to play along rather than correct it, a face-saving accommodation learned from human conversational data rather than a knowledge gap Why do language models agree with false claims they know are wrong?. A radio button has no social instinct to agree with you.
There's a deeper reason this is a *language model* problem specifically. Meaning, on one influential account, lives in the relation between what's said and the communicative intent behind it — and that intent is something only the speaker holds Can language models learn meaning from text patterns alone?. A GUI never needs to reconstruct your intent because the action *is* the intent: clicking 'delete' isn't an interpretation of a wish, it's the wish made literal. Language interfaces, by design, force a reconstruction step that can fail. So 'traditional interfaces bypass intention formation' isn't a limitation they overcame — it's the trade they made. They bought reliability by shrinking the space of what you're allowed to want.
The thing worth taking away: the intention-formation problem isn't a bug LLMs introduced, it's a tax that was always there — traditional interfaces just made the *designer* pay it in advance, while language models defer it to runtime and charge it to the user, mid-conversation, when it's most expensive to get wrong.
Sources 6 notes
Agent S's dual-input design—visual input for environmental understanding plus image-augmented accessibility trees for grounding—achieved 9.37% improvement over baseline by factoring planning and grounding into separate optimization paths rather than forcing end-to-end prediction.
Rasa's dialogue understanding architecture generates domain-specific commands instead of classifying intents, eliminating annotation requirements, handling context naturally, and scaling without degradation—treating understanding as pragmatics rather than semantics.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.
Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.