Can language models adapt implicature to conversational context?
Do large language models flexibly modulate scalar implicatures based on information structure, face-threatening situations, and explicit instructions—as humans do? This tests whether pragmatic computation is truly context-sensitive or merely literal.
Scalar implicatures are a core pragmatic phenomenon: when someone says "some," it typically implies "not all." This is not semantically entailed but pragmatically inferred based on the maxim of quantity — if all were true, the speaker would have said "all." Human computation of these implicatures is sensitive to communicative context in documented ways.
Three experiments from Pragmatic Implicature Processing in ChatGPT (Ruytenbeek et al. 2024) tested whether ChatGPT shows human-like context-sensitivity in implicature. All three failed:
Generalized conversational implicatures: Humans can inhibit implicature computation when explicitly instructed to interpret utterances literally. ChatGPT failed to show this distinction — it doesn't switch between pragmatic and semantic processing modes.
Information structure sensitivity: For scalar implicatures, humans compute more "some but not all" inferences when the scalar term is in the information focus (the direct answer to an explicit question) than when it is in the background. ChatGPT showed no sensitivity to information structure.
Face context: Human scalar implicature rates differ between face-threatening and face-boosting contexts. If a poem is being evaluated and someone says "some people loved it," the implicature ("not all loved it") is more prominent in face-boosting contexts. ChatGPT showed no differential response to face context.
These are not exotic phenomena. They are the basic flexibility that allows human conversation to be more than literal string exchange. Pragmatic competence requires tracking the communicative context — who is asking, why, what stakes are involved — and modulating interpretation accordingly. ChatGPT's failure is not isolated to edge cases; it extends to routine context-modulation effects that appear in any human conversation.
A complementary finding in non-literal language: GPT-4o significantly overestimates irony likelihood in emojis compared to human perception (median irony scores significantly higher, W = 918.5, p < .001). When prompted to rate the likelihood of specific emojis being used ironically, GPT-4o considers the same emojis more likely to express irony than humans do — possibly due to disproportionate representation of ironic emoji usage in training data. Demographic information in prompts does not substantially change GPT-4o's irony classification. This parallels the implicature failure: the model cannot calibrate to actual human pragmatic norms for non-literal communication, whether the signal is scalar implicature or visual irony.
Inquiring lines that use this note as a source 26
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do published prose training data omit solicitation as a discourse property?
- Can LLMs infer situational context the way humans do pragmatically?
- What are Gricean maxims and why do language models violate them?
- Why does context collapse pose risks in high-stakes conversations?
- How do fixed pragmatic templates prevent models from understanding context?
- Can language models adapt irony detection to specific communicative contexts?
- What makes relational structure sufficient for generating contextually appropriate discourse?
- Why do conversational pivots require explicit re-prompting instead of natural evolution?
- How do politeness strategies depend on semantic ambiguity between literal and intended meaning?
- How does semantic ambiguity differ from structural ambiguity in language?
- What reader assumptions underlie anaphoric versus cataphoric discourse patterns?
- Why do context-sensitive languages transfer better than regular or context-free languages?
- Do LLMs compute scalar implicature differently across conversational contexts?
- Can language models distinguish explicit from implicit discourse relations?
- Do language models calibrate to actual human pragmatic norms?
- Why do explicit discourse connectives work when implicit relations fail?
- Why do language models struggle with context-dependent pragmatic interpretation?
- Why do language models avoid directness when face-saving rather than for civility?
- Why do only context-sensitive formal languages transfer effectively to natural language?
- Can presupposition projection strength vary by context in embeddings?
- Why do non-factive verbs and triggers both fool language models?
- Why do language models treat presupposition triggers as categorical patterns?
- Why do explicit linguistic markers override semantic computation in models?
- What role does discourse structure play in determining at-issueness?
- How do linguistic norms for expressing certainty vary across languages and models?
- Can pragmatic competence emerge from text exposure alone without interactive grounding?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why does ChatGPT fail at implicit discourse relations?
ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?
scalar implicatures are implicit inferences; extends this insight
-
Why do language models fail at communicative optimization?
LLMs excel at learning surface statistical patterns from text but struggle with deeper principles of how language achieves efficient communication. What distinguishes these two types of linguistic knowledge?
implicature computation is a communicative optimization principle not captured by distribution
-
Why do speakers need to actively calibrate shared reference?
Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.
context-sensitivity in implicature is part of the calibration that LLMs skip
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Pragmatic Implicature Processing in ChatGPT
- The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs
- Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom
- Conversational Alignment with Artificial Intelligence in Context
- Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds
- Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions
- Meanings are like Onions: a Layered Approach to Metaphor Processing
- Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation
Original note title
llm scalar implicature computation fails to adapt to communicative context