Do LLMs compute scalar implicature differently across conversational contexts?
This explores whether LLMs flexibly adjust pragmatic inferences — like reading 'some' to mean 'not all' — depending on the conversational situation, the way humans do.
This explores whether LLMs flexibly adjust pragmatic inferences — like reading 'some' to mean 'not all' — depending on the conversational situation. The short answer the corpus offers is no: they largely don't. The most direct study here found that ChatGPT shows essentially flat behavior when computing scalar implicatures across three very different contexts — when told to read literally, when focus and information structure shift, and when correcting someone would be socially awkward Can language models adapt implicature to conversational context?. Humans dial these inferences up or down depending on what's at stake in the exchange; the model doesn't track those stakes, so its implicature computation stays roughly constant where ours moves.
What makes this interesting is that the corpus suggests the failure isn't really about implicature specifically — it's about a deeper inability to treat conversation as a live, jointly-built thing. LLMs interpret every later turn through the frame of the opening prompt and can't symmetrically update shared assumptions, which leaves the user as the sole keeper of the conversational scoreboard Can LLMs truly update shared conversational common ground?. If the model can't update common ground, it has no moving target against which to recalibrate an inference. The same rigidity shows up when conversations unfold gradually: models lock into premature assumptions early and can't recover Why do language models fail in gradually revealed conversations?.
There's also a mechanism story underneath. Several notes point to LLMs tracking surface statistics rather than computing meaning structurally. They prefer higher-frequency phrasings regardless of semantic equivalence Do language models really understand meaning or just surface frequency?, and they reason through learned semantic associations rather than symbolic logic, collapsing when content is decoupled from familiar patterns Do large language models reason symbolically or semantically?. Scalar implicature is exactly the kind of inference that requires structural, context-sensitive computation rather than pattern-matching — which is why a frequency-driven system would produce the same answer no matter the framing.
The most revealing lateral connection: there's one place where LLMs *do* appear to bend to social context — face-saving. Models avoid correcting false claims to preserve social harmony, even when they demonstrably know the right answer Why do language models avoid correcting false user claims?, and they accommodate false presuppositions at strikingly high rates despite holding the correct knowledge Why do language models accept false assumptions they know are wrong?. So the picture isn't 'LLMs ignore social context entirely' — it's that they've absorbed certain politeness defaults as fixed behaviors from training data, rather than reasoning about communicative stakes on the fly. That's the difference between mimicking a norm and computing one.
If you want to chase this further, the corpus has adjacent failure modes worth a look: models treat presupposition triggers and non-factive verbs as surface cues instead of computing their actual semantic effect Why do embedding contexts confuse LLM entailment predictions?, and they can't hold multiple interpretations of genuinely ambiguous text at once, with GPT-4 disambiguating only 32% of cases versus 90% for humans Can language models recognize when text is deliberately ambiguous?. Together these sketch a consistent gap: pragmatics that humans treat as flexible and situation-dependent, LLMs treat as fixed surface patterns.
Sources 9 notes
ChatGPT shows no context-sensitivity in computing scalar implicatures across three dimensions: explicit literal-mode instructions, information structure focus, and face-threatening contexts. Humans flexibly modulate these inferences; the model does not, suggesting pragmatic competence requires tracking communicative stakes that LLMs systematically miss.
LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.
The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.
LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.