Can LLMs infer situational context the way humans do pragmatically?
This explores whether LLMs do pragmatic inference — reading the unstated meaning that depends on situation, speaker intent, and conversational stakes — the way humans do, or whether they only mimic its surface.
This question is really about pragmatics: the human knack for inferring what's *meant* from what's *said*, given who's speaking, to whom, and why. The corpus is unusually direct here, and the verdict leans skeptical — LLMs reproduce the outputs of pragmatic reasoning without tracking the situational variables that drive it. The clearest case is scalar implicature: when you say "some of the students passed," a human infers "not all," but flexibly drops that inference when the context (a literal-minded instruction, a face-threatening situation) calls for it. ChatGPT computes the implicature but shows no sensitivity to those contextual dials at all Can language models adapt implicature to conversational context?. The pragmatic machinery is present; the situational steering wheel is not.
The same shape recurs across adjacent phenomena. Models systematically fail to recognize that text is *deliberately ambiguous* — GPT-4 disambiguates only 32% of cases where humans hit 90% — because they can't hold two interpretations live at once and pick based on context Can language models recognize when text is deliberately ambiguous?. They also misread presupposition triggers and non-factive verbs ("he *pretended* to leave" vs. "he *managed* to leave"), treating these context-shifting cues as surface patterns rather than computing how they flip an inference Why do embedding contexts confuse LLM entailment predictions?. And in open-ended perspective-taking, LLMs default to surface strategies instead of genuinely modeling another mind — notably, architectures that *force* explicit belief-tracking outperform LLMs alone, suggesting the gap is structural, not just a matter of more training Do large language models genuinely simulate mental states?.
Why this pattern? Several notes point to the same root: these models reason by semantic association over their training distribution, not by manipulating structure. Strip the familiar semantics out of a task and performance collapses even when the rules are handed to the model Do large language models reason symbolically or semantically?. Entailment judgments lean on whether a conclusion *looks attested* in training data rather than whether the premise actually supports it Do LLMs predict entailment based on what they memorized?. Pragmatic inference demands exactly the structural, context-conditional computation these failures reveal is missing — which is why "potemkin understanding" shows up: a model can correctly *explain* a pragmatic concept, fail to *apply* it, and even recognize the failure, a triple incoherence no human shows Can LLMs understand concepts they cannot apply?.
Here's the twist worth carrying away. The deficit is selective, not total. LLMs handle *causal* relations well because causal connectives are explicit and frequent in text, while *temporal* ordering — which must be inferred from context — trails behind Why do LLMs handle causal reasoning better than temporal reasoning?. Pragmatics is the hard case for the same reason temporal reasoning is: the load-bearing signal is *implicit*, exactly what compression-from-text doesn't capture well. And yet, on the modeling side, LLMs fine-tuned on psychology-experiment data predict human decisions better than purpose-built cognitive theories Can language models learn to model human decision making?. So the picture splits: a model can be a strong *external predictor* of how situated humans behave while remaining a poor *internal performer* of the situated inference itself.
If you want the deepest cut, two notes reframe the whole question. One argues LLMs operationalize Saussure's *langue* — meaning as pure relational structure compressed from text, with no external referents — which would explain why situational grounding is precisely what's absent Can language models learn meaning without engaging the world?. The other, via Habermas, suggests that from the *observer's* view humans and LLMs differ categorically, but as *participants in shared discourse* they draw on the same symbolic substrate — making the difference structural rather than absolute Do humans and LLMs differ fundamentally or just superficially?. Read together, they suggest the answer isn't a flat "no" but "not the same way": LLMs infer from the relational shadow language casts, while human pragmatics is anchored in a situation the model never actually occupies.
Sources 11 notes
ChatGPT shows no context-sensitivity in computing scalar implicatures across three dimensions: explicit literal-mode instructions, information structure focus, and face-threatening contexts. Humans flexibly modulate these inferences; the model does not, suggesting pragmatic competence requires tracking communicative stakes that LLMs systematically miss.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.
ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.
LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.
Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.
Applied Habermas's observer/participant distinction to AI: from outside, humans and LLMs are utterly different; from within shared discourse, both draw on the same symbolic substrate, making the difference structural rather than absolute.