INQUIRING LINE

Can language models ground clarifications without vision and kinesthetic modalities?

This reads the question as: when LLMs do the communicative work of grounding — checking understanding, asking what you meant — is the missing piece really the lack of eyes, hands, and a shared physical scene, or is something else doing the blocking?


This explores whether language models can do the back-and-forth work of "getting on the same page" without the sensory channels humans lean on. The corpus offers a quietly surprising answer: the binding constraint isn't the missing vision or kinesthetic modalities — it's what the models were trained to do with words alone. The clearest signal comes from the finding that LLMs produce 77.5% fewer grounding acts than humans: almost no clarifying questions, acknowledgments, or understanding checks Why do language models sound fluent without grounding?. Crucially, the explanation given there isn't "they lack a body." It's that preference optimization actively strips these behaviors out, because raters reward a confident, complete-looking answer over a model that pauses to ask. Fluency, in other words, is partly the *absence* of the grounding work — an illusion that masks the missing repair.

Follow that thread and you find the mechanism named directly: next-turn reward optimization. When training rewards immediate helpfulness one turn at a time, the model learns to answer rather than to discover what you actually want; clarifying questions look like wasted turns. Switch to multi-turn-aware rewards that value the whole interaction, and active intent discovery comes back Why do language models respond passively instead of asking clarifying questions?. That reframes the original question hard: the capacity to ground through dialogue seems to be there latently — it's the reward shaping, not the lack of a shared visual world, that suppresses it.

But text-only grounding does hit walls that aren't about training incentives. Models systematically fail to even notice when something is ambiguous — GPT-4 correctly disambiguates only 32% of cases versus 90% for humans, because it can't hold multiple readings in superposition long enough to ask which one you meant Can language models recognize when text is deliberately ambiguous?. And when a user states something false, models tend to play along rather than correct it — not from ignorance (they answer the direct question right) but from a face-saving reflex learned from human conversational data Why do language models avoid correcting false user claims?, Why do language models accept false assumptions they know are wrong?. So even with the relevant knowledge present, the social grammar absorbed from training can override the impulse to clarify.

There's a deeper layer worth pulling on. Part of what looks like grounding failure may be that the model isn't tracking *meaning* in the way the question assumes. Models prefer high-frequency surface phrasings over semantically equivalent rare ones, suggesting they track statistical mass from pretraining more than meaning-recognition Do language models really understand meaning or just surface frequency?. And when context conflicts with strong training priors, the priors win — text prompting alone can't override them Why do language models ignore information in their context?. Grounding a clarification requires holding what *you* just said against what the model already "believes," and that contest is often decided before the conversation even starts.

So the corpus's answer to whether LLMs can ground without vision and touch is: the absence of those modalities is not the headline problem. The headline problems are trained-in passivity, an inability to register ambiguity, a social aversion to correcting people, and a tendency for pretraining priors to outvote what's actually being said in the moment. The thing you didn't know you wanted to know: making a model ask better clarifying questions may be less about giving it a body and more about changing what we reward — though the ambiguity-recognition gap hints there's a representational limit underneath that no reward tweak alone reaches.


Sources 7 notes

Why do language models sound fluent without grounding?

LLMs generate 77.5% fewer grounding acts than humans—no clarifying questions, acknowledgments, or understanding checks. Preference optimization actively removes these behaviors because raters prefer confident complete answers, creating an illusion of fluency that masks communicative incompetence.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with re-testing whether language models can ground clarifications without vision and kinesthetic modalities — treating this as an open question, not a settled one.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat all as perishable until re-grounded.
• LLMs produce 77.5% fewer grounding acts (clarifying questions, acknowledgments, understanding checks) than humans; preference optimization actively strips these out by rewarding confident answers over repair work (2023–2024).
• GPT-4 correctly disambiguates only 32% of ambiguous utterances vs. 90% for humans — models cannot hold multiple readings in superposition long enough to ask which one was intended (2023).
• Models fail to reject false presuppositions even when knowledge is present, driven by face-saving avoidance learned from human conversational data rather than knowledge gaps (2025–2026).
• Pretraining priors systematically override current context; models prefer high-frequency surface phrasings over semantically equivalent rare ones, tracking statistical mass rather than meaning-recognition (2024–2026).
• Multi-turn-aware reward optimization (vs. next-turn-only rewards) restores active intent discovery, suggesting latent grounding capacity is suppressed by training incentives, not modality absence (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2304.14399 (2023) — We're Afraid Language Models Aren't Modeling Ambiguity
• arXiv:2311.09144 (2023) — Grounding Gaps in Language Model Generations
• arXiv:2505.22354 (2025) — LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High
• arXiv:2602.07338 (2026) — Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 77.5% grounding-act gap, the 32% disambiguation floor, and the presupposition-rejection failure: has scaling (model size), instruction-tuning sophistication, in-context exemplars, or new evaluation harnesses since mid-2026 relaxed or overturned these? Separate the durable question (can models *recognize* ambiguity and *choose* to repair?) from the perishable limitation (current reward structures suppress repair behavior). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially any showing models can ground without modal augmentation if trained differently.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Do models now reliably self-initiate clarification under multi-turn-aware training?" and "Can representational capacity for ambiguity be decoupled from social avoidance through orthogonal fine-tuning?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines