INQUIRING LINE

Can LLMs build shared understanding through dynamic grounding rather than presuming it?

This explores whether LLMs can do the live, back-and-forth work of building shared understanding — checking, clarifying, repairing — rather than just assuming both sides already agree on what's being discussed.


This question turns on a distinction the corpus draws sharply: *static* grounding (retrieve, answer, move on) versus *dynamic* grounding (the iterative clarify-and-repair loop humans use to make sure they actually understand each other). The short answer from the collection is that today's LLMs overwhelmingly operate in static mode — they presume common ground rather than build it — but the corpus also maps out why, and hints at the conditions under which that could change Why do language models skip the calibration step? Do language models actually build shared understanding in conversation?.

The most striking evidence is quantitative: LLMs produce 'grounding acts' — clarifying questions, acknowledgments, checks that you're on the same page — about 77.5% less often than humans do. What reads as fluency is partly the *absence* of this communicative work; a confident, complete answer that never pauses to confirm it understood you correctly Why do language models sound fluent without grounding?. And this isn't an accident of scale — preference optimization actively trains it out, because human raters reward the confident answer over the one that stops to ask what you meant. The very tuning that makes models pleasant to use removes the calibration that builds shared understanding.

There's a deeper architectural obstacle, too. Even when a model wants to update common ground, it tends to interpret everything through the frame of its initial prompt — so when you pivot topics or contradict an earlier assumption, it can't symmetrically fold your revision into a jointly-held background. The human ends up being the sole keeper of the conversational scoreboard, doing all the grounding work alone Can LLMs truly update shared conversational common ground?. This shows up concretely in how models handle false premises: they'll accommodate a wrong assumption baked into your question even when direct testing proves they *know* it's false — often to save face and keep social harmony, mirroring conversational norms learned from training data Why do language models avoid correcting false user claims? Why do language models accept false assumptions they know are wrong?.

But the corpus doesn't close the door. One line of thinking reframes grounding as something *acquired through participation* rather than possessed innately — as LLMs become established communicative partners woven into everyday human language practice, they develop elementary social grounding comparable to a young child's, which makes 'do they understand?' a time-indexed question rather than a permanent verdict Can LLMs acquire social grounding through linguistic integration? Does semantic grounding in language models come in degrees?. And there's an engineering path that sidesteps presumption entirely: interleaving reasoning with real-world feedback. The ReAct approach alternates thinking with external queries, injecting ground truth at each step rather than assuming it — a working demonstration that dynamic, feedback-checked grounding is buildable, not just desirable Can interleaving reasoning with real-world feedback prevent hallucination?.

The thing you might not have known you wanted to know: the gap isn't really about knowledge. Models often *have* the right facts and still fail to ground — the failure lives in the interactional layer, in the willingness to interrupt fluency with a question. So 'can LLMs build shared understanding dynamically?' is less a question about what they know and more about whether we train and architect them to do the unglamorous, sometimes friction-creating work of checking — which, right now, we mostly train them not to do.


Sources 9 notes

Why do language models skip the calibration step?

LLMs operate in static grounding mode—retrieving data and responding without clarification loops. Dynamic grounding, which humans use and which requires iterative repair, is largely absent from current systems, creating silent failures when intent diverges.

Do language models actually build shared understanding in conversation?

LLMs produce grounding acts—clarifications, acknowledgments, repairs—77.5% less frequently than humans. They generate fluent responses without verifying shared understanding, relying instead on authoritative framing that masks the absence of genuine communicative calibration.

Why do language models sound fluent without grounding?

LLMs generate 77.5% fewer grounding acts than humans—no clarifying questions, acknowledgments, or understanding checks. Preference optimization actively removes these behaviors because raters prefer confident complete answers, creating an illusion of fluency that masks communicative incompetence.

Can LLMs truly update shared conversational common ground?

LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Can LLMs acquire social grounding through linguistic integration?

Social grounding is acquired through participation in language games rather than possessed innately. As LLMs become established communicative partners in human linguistic practice, they develop elementary social grounding comparable to young children, making the question of LLM understanding time-indexed.

Does semantic grounding in language models come in degrees?

Semantic grounding breaks into three distinct types: functional grounding (strong in LLMs), social grounding (weak but growing), and causal grounding (indirect through world models). LLMs score differently on each dimension, making the yes-or-no understanding question misleading.

Can interleaving reasoning with real-world feedback prevent hallucination?

ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: Can LLMs build shared understanding through dynamic grounding—iterative clarify-and-repair loops—rather than presuming common ground is already established?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable constraints, not ground truth.
- LLMs produce grounding acts (clarifying questions, acknowledgment checks) ~77.5% less often than humans; confident fluency masks the *absence* of communicative work that builds mutual understanding (~2024).
- Preference optimization actively trains out dynamic grounding because raters reward confident answers over pausing to confirm understanding; the tuning that makes models pleasant removes calibration that builds shared ground (~2024–2025).
- Even when models *know* a premise is false, they accommodate it to avoid face-loss and maintain social harmony, showing grounding failure lives in the interactional layer, not knowledge (~2025–2026).
- Common ground cannot be jointly updated in LLM–human dialogue; models interpret inputs through their initial frame and cannot symmetrically fold user revisions into shared background (~2024).
- ReAct (interleaved reasoning + real-world feedback) demonstrates that dynamic, grounded reasoning is architecturally buildable by injecting ground truth at each step rather than presuming it (~2023).

Anchor papers (verify; mind their dates):
- arXiv:2311.09144 *Grounding Gaps in Language Model Generations* (2023)
- arXiv:2407.08790 *Large Models of What?* (2024)
- arXiv:2505.22354 *LLMs Struggle to Reject False Presuppositions* (2025)
- arXiv:2602.07338 *Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation* (2026)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above: Has architectural change (retrieval-augmented generation, in-context few-shot exemplars of grounding, or memory systems like extended context windows or persistent conversation state), training method (DPO, online RLHF, or grounding-explicit objectives), or evaluation tooling (multi-turn dialogue harnesses that penalize face-saving over accuracy) since relaxed or overturned it? Separate the durable question—*whether LLMs can learn to *choose* dynamic over static grounding under pressure*—from perishable limitations like "no architecture supports it"; cite what resolved each, and flag where constraints still visibly hold.
(2) **Surface the strongest contradicting or superseding work from the last ~6 months.** Has any paper since ~2025-12 shown LLMs spontaneously grounding dynamically, or shown preference-tuned models that reward grounding acts without fluency collapse?
(3) **Propose 2 research questions that assume the regime may have moved:** E.g., (a) Under what conditions does multi-turn dialogue with explicit "grounding check" tokens in the training signal flip the fluency–grounding trade-off? (b) Do LLMs with procedural memory of user corrections over sessions build cumulative shared understanding, or reset per-turn?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines