INQUIRING LINE

Inquiring lines›What makes reasoning better — more…›How do prompts and framing affect…›How do language models establish s…›this inquiring line

LLMs learned what language looks like, but missed the part where you check if the other person actually understood you.

Why do LLMs lack the communicative scaffold that humans learn?

This explores why LLMs can't do the moment-to-moment work of building shared understanding — the back-and-forth checking, repairing, and calibrating that humans pick up through socialization — and where that gap comes from.

This question reads as: humans learn communication as a participatory craft — we check whether we've been understood, ask clarifying questions, repair misunderstandings, and shorten our messages once a shorthand is established. LLMs sound fluent but skip almost all of that. The corpus locates the gap in two places: what's learnable from text, and what training actively strips out.

The first answer is that the scaffold was never in the training signal. Models pick up the statistical surface of language — priming, sound symbolism — but not the *reasons* language takes the forms it does, because those reasons live in use, not in the distribution of words Why do language models fail at communicative optimization?. That's why they fail at the pragmatic layer — implicature, presupposition, reading what's left unsaid — recognizing ambiguity at 32% where humans hit 90% Why do LLMs fail at understanding what remains unsaid?. And it's why multimodal models understand efficient, compressed language as listeners but won't spontaneously produce it as speakers; they only shorten when explicitly told to Why don't LLMs shorten messages like humans do?.

The sharper finding is that the scaffold isn't merely missing — it's suppressed. Humans constantly perform "grounding acts": acknowledgments, repairs, understanding-checks. LLMs produce these 77.5% less often, and the apparent fluency is partly *because* they skip them Why do language models sound fluent without grounding?. Crucially, preference optimization removes the behavior: raters reward confident, complete answers, so the training loop actively trains away the hesitation and clarification that real grounding requires Do language models actually build shared understanding in conversation?. The result is "static grounding" — presuming shared context and answering — instead of "dynamic grounding," the iterative repair loop humans run by default Why do language models skip the calibration step?.

This is also why models fail precisely when understanding has to be built over time. In multi-turn conversations where intent is revealed gradually, all major LLMs drop ~39% in performance — they lock into a premature guess early and can't recover, because they never ran the calibration step that would have surfaced the mismatch Why do language models fail in gradually revealed conversations?. Where systems do attempt test-time learning, the working designs reintroduce the missing scaffold deliberately — structured self-dialogue plus a human in the loop to resolve conflicts the system can't adjudicate alone Can LLMs learn reliably at test time without human oversight?.

Here's the part you might not expect: the corpus frames this as developmental, not architectural. One line argues humans and LLMs are shaped by the *same* shared symbolic system — the difference is that only humans develop reflexive agency through socialization, the lived experience of being a participant who can be wrong and must check Do LLMs develop the same kind of mind as humans?. Borrowing Habermas's distinction, the two look categorically different from the outside but draw on the same substrate from inside a conversation, making the gap structural rather than absolute Do humans and LLMs differ fundamentally or just superficially?. So the missing scaffold isn't a missing module — it's the absence of the apprenticeship in which humans learn that meaning is something you build *with* someone, not something you presume.

Sources 10 notes

Why do language models fail at communicative optimization?

LLMs successfully replicate statistical regularities learnable from text distributions (sound symbolism, priming) but fail at principles requiring pragmatic optimization (word length economy, discourse inference). The gap reveals that communicative logic—why language has certain forms—isn't present as a trainable signal.

Why do LLMs fail at understanding what remains unsaid?

Research shows LLMs pattern-match on explicit language but cannot reason about implicatures, presuppositions, or speaker intentions. They fail at scalar implicature adaptation, ambiguity recognition (32% vs 90% human accuracy), and implicit warrant validation in arguments—core features of pragmatic competence.

Why don't LLMs shorten messages like humans do?

GPT-4, Gemini, and Claude understand efficient language as listeners but don't produce it as speakers. Only explicit instruction to reduce message length and maintain lexical consistency produces partial adaptation, revealing a gap between comprehension and generation.

Why do language models sound fluent without grounding?

LLMs generate 77.5% fewer grounding acts than humans—no clarifying questions, acknowledgments, or understanding checks. Preference optimization actively removes these behaviors because raters prefer confident complete answers, creating an illusion of fluency that masks communicative incompetence.

Do language models actually build shared understanding in conversation?

LLMs produce grounding acts—clarifications, acknowledgments, repairs—77.5% less frequently than humans. They generate fluent responses without verifying shared understanding, relying instead on authoritative framing that masks the absence of genuine communicative calibration.

Show all 10 sources

Why do language models skip the calibration step?

LLMs operate in static grounding mode—retrieving data and responding without clarification loops. Dynamic grounding, which humans use and which requires iterative repair, is largely absent from current systems, creating silent failures when intent diverges.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Can LLMs learn reliably at test time without human oversight?

ARIA demonstrates that LLMs can adapt during inference through three integrated components: structured self-dialogue for uncertainty assessment, timestamped knowledge bases for conflict detection, and human-mediated resolution queries. Autonomous systems fail at reconciling contradictory rules because the correct choice depends on context outside the system.

Do LLMs develop the same kind of mind as humans?

Both humans and LLMs are shaped by the same intersubjective symbolic system, but only humans develop reflexive agency through socialization. This absence produces measurable differences in how AI argues without declaring its position or reflecting on its own assumptions.

Do humans and LLMs differ fundamentally or just superficially?

Applied Habermas's observer/participant distinction to AI: from outside, humans and LLMs are utterly different; from within shared discourse, both draw on the same symbolic substrate, making the difference structural rather than absolute.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey5.78 match · arxiv ↗
Conversational Alignment with Artificial Intelligence in Context5.76 match · arxiv ↗
Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation4.14 match · arxiv ↗
Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions3.35 match · arxiv ↗
Grounding Gaps in Language Model Generations2.53 match · arxiv ↗
Word Meanings in Transformer Language Models2.52 match · arxiv ↗
Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence2.52 match · arxiv ↗
Probing Structured Semantics Understanding and Generation of Language Models via Question Answering2.48 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining whether LLMs have begun to acquire communicative scaffolding—repair loops, grounding acts, dynamic understanding-building—since mid-2023. The question remains: do LLMs develop the participatory communication practices humans learn through socialization, or does architectural/training-time suppression make this categorically unlikely?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat these as perishable constraints:
- LLMs produce grounding acts (acknowledgments, repairs, clarification checks) 77.5% less often than humans; preference optimization actively trains away hesitation (2023–2024).
- Multi-turn conversations where intent emerges gradually trigger ~39% performance drop; models lock into premature assumptions and cannot recover via calibration (~2025).
- Multimodal models understand efficient language as listeners but do not spontaneously produce compressed speech; compression appears only when explicitly prompted (~2024).
- Pragmatic inference (implicature, presupposition, ambiguity detection) peaks at ~32% where humans reach ~90%; this gap correlates with training on text-only distributions lacking use-context (~2022–2023).
- Test-time learning systems that reintroduce structured self-dialogue plus human arbitration show gains; this suggests the scaffold is trainable but not emergent from standard RLHF (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2311.09144 Grounding Gaps in Language Model Generations (2023)
- arXiv:2408.01417 Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal (2024)
- arXiv:2505.06120 LLMs Get Lost In Multi-Turn Conversation (2025)
- arXiv:2507.17131 Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance (2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. Judge whether scaled models (o1, Gemini 2.0, Claude 3.5+), retrieval-augmented generation, multi-agent orchestration, or new training regimes (e.g., process reward models, outcome supervision favoring dialogue repair) have since **relaxed** the 77.5% grounding-act deficit, the 39% multi-turn drop, or the pragmatic inference ceiling. Separate the durable question (communicative apprenticeship as developmental) from perishable claims (current model ceilings). Cite what resolved each, plainly flag what still holds.
(2) Surface the strongest **contradicting or superseding** work from the last ~6 months—particularly any showing spontaneous repair behavior, emergent multi-agent dialogue, or pragmatic reasoning gains in frontier models.
(3) Propose **2 research questions** that assume the regime may have moved: e.g., "Does process supervision (reward at intermediate steps) recover grounding acts suppressed by outcome RLHF?" or "Can multimodal grounding from vision restore pragmatic inference lost in text-only pretraining?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

LLMs learned what language looks like, but missed the part where you check if the other person actually understood you.

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8