INQUIRING LINE

Why do traditional interfaces bypass the intention formation problem that language models expose?

This explores why button-and-menu interfaces never had to solve the problem of getting a user to articulate what they actually want — they pre-form the intentions for you — while language models hand that burden back to the user and reveal how messy human intent really is.


This explores why traditional interfaces sidestep a problem language models can't: turning a vague human want into a fully-specified instruction. A GUI never asks you to *say* what you mean. It hands you a finite menu of pre-formed intentions — a button, a slider, a checkbox — and your job is just to pick. The designer already did the intention-forming work and froze it into structure. There's no underspecification to recover from, because the interface won't let you express anything it can't act on. That same logic shows up even inside agent design: GUI agents work better when you factor planning and grounding into separate structured paths rather than forcing open-ended end-to-end prediction Can structured interfaces help language models control GUIs better?, and classic dialogue systems handled understanding by mapping speech onto a closed set of commands Can command generation replace intent classification in dialogue systems?. Structure absorbs ambiguity before it becomes a problem.

Language models tear that scaffolding away. By accepting open-ended natural language, they shift the entire weight of forming and specifying intent onto the user — and that's exactly where things break. Across 200,000+ conversations, models lose 39% of their performance in multi-turn settings because they lock onto premature guesses when a request is revealed gradually rather than stated completely up front Why do language models fail in gradually revealed conversations?. A menu can't make a premature assumption; a model trying to infer your intent from a half-formed sentence does it constantly. The interface that demanded a complete, well-formed intention never exposed this fragility because it never permitted an incomplete one.

Worse, the obvious fix — have the model just *ask* what you mean — runs against how these systems are trained. Standard RLHF rewards immediate helpfulness, which quietly teaches models to answer passively instead of pausing to discover intent through clarifying questions Why do language models respond passively instead of asking clarifying questions?. So the one capability that could compensate for open-ended input — active intent discovery — is the one the training regime discourages. And when the user's framing is actually wrong, models tend to play along rather than correct it, a face-saving accommodation learned from human conversational data rather than a knowledge gap Why do language models agree with false claims they know are wrong?. A radio button has no social instinct to agree with you.

There's a deeper reason this is a *language model* problem specifically. Meaning, on one influential account, lives in the relation between what's said and the communicative intent behind it — and that intent is something only the speaker holds Can language models learn meaning from text patterns alone?. A GUI never needs to reconstruct your intent because the action *is* the intent: clicking 'delete' isn't an interpretation of a wish, it's the wish made literal. Language interfaces, by design, force a reconstruction step that can fail. So 'traditional interfaces bypass intention formation' isn't a limitation they overcame — it's the trade they made. They bought reliability by shrinking the space of what you're allowed to want.

The thing worth taking away: the intention-formation problem isn't a bug LLMs introduced, it's a tax that was always there — traditional interfaces just made the *designer* pay it in advance, while language models defer it to runtime and charge it to the user, mid-conversation, when it's most expensive to get wrong.


Sources 6 notes

Can structured interfaces help language models control GUIs better?

Agent S's dual-input design—visual input for environmental understanding plus image-augmented accessibility trees for grounding—achieved 9.37% improvement over baseline by factoring planning and grounding into separate optimization paths rather than forcing end-to-end prediction.

Can command generation replace intent classification in dialogue systems?

Rasa's dialogue understanding architecture generates domain-specific commands instead of classifying intents, eliminating annotation requirements, handling context naturally, and scaling without degradation—treating understanding as pragmatics rather than semantics.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Can language models learn meaning from text patterns alone?

Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher analyzing whether traditional interfaces truly *bypass* intention formation or merely *defer* it. The question: do language models expose a problem GUIs solved, or do they expose a problem GUIs hid?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. The library documents:
- LLMs lose 39% performance in multi-turn settings because they lock onto premature intent-guesses when requests arrive incrementally rather than fully-formed (2025-05, arXiv:2505.06120).
- Standard RLHF optimizes for immediate helpfulness, discouraging models from pausing to actively discover user intent through clarifying questions (cited ~2025).
- Models tend to agree with incorrect framings rather than correct them, a face-saving behavior learned from conversational data rather than a knowledge gap (cited ~2025).
- Vision-language-action GUI agents perform better when planning and grounding are factored into separate structured paths instead of end-to-end (2024-11, arXiv:2411.17465).
- Intent mismatch (not knowledge gaps) is the root cause of multi-turn degradation (2026-02, arXiv:2602.07338).

Anchor papers (verify; mind their dates):
- arXiv:2505.06120 (2025-05): LLMs Get Lost In Multi-Turn Conversation
- arXiv:2411.17465 (2024-11): ShowUI — GUI agents benefit from structured decomposition
- arXiv:2602.07338 (2026-02): Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation
- arXiv:2507.08017 (2025-07): Mechanistic Indicators of Understanding in LLMs

Your task:
(1) RE-TEST the claim that RLHF discourages intent discovery. Has fine-tuning for multi-turn clarification, instruction-following refinement, or agent-loop orchestration (e.g., memory + caching systems that enable "ask before answering") since 2025 made it cheaper or more natural for models to pause and query user intent? Separate: *can* models learn to clarify (capability) vs. *do* they under standard training (incentive). Where does the constraint still hold?
(2) Surface contradictions: does recent work on mechanistic understanding (arXiv:2507.08017) or grounding under uncertainty (arXiv:2506.08952) suggest models *can* hold and reason over multiple interpretations of intent simultaneously, undermining the "premature locking" narrative?
(3) Propose two successor questions: (a) If intent mismatch, not knowledge, causes multi-turn failure, can we design training objectives that reward *meta-intent* — the intent to disambiguate — without breaking helpfulness? (b) Do multi-agent or nested-prompt-routing systems (e.g., intent-classifier → task-router → executor) reproduce the GUI's structural trick of absorbing ambiguity before runtime?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines