INQUIRING LINE

Can LLMs infer situational context the way humans do pragmatically?

This explores whether LLMs do pragmatic inference — reading the unstated meaning that depends on situation, speaker intent, and conversational stakes — the way humans do, or whether they only mimic its surface.


This question is really about pragmatics: the human knack for inferring what's *meant* from what's *said*, given who's speaking, to whom, and why. The corpus is unusually direct here, and the verdict leans skeptical — LLMs reproduce the outputs of pragmatic reasoning without tracking the situational variables that drive it. The clearest case is scalar implicature: when you say "some of the students passed," a human infers "not all," but flexibly drops that inference when the context (a literal-minded instruction, a face-threatening situation) calls for it. ChatGPT computes the implicature but shows no sensitivity to those contextual dials at all Can language models adapt implicature to conversational context?. The pragmatic machinery is present; the situational steering wheel is not.

The same shape recurs across adjacent phenomena. Models systematically fail to recognize that text is *deliberately ambiguous* — GPT-4 disambiguates only 32% of cases where humans hit 90% — because they can't hold two interpretations live at once and pick based on context Can language models recognize when text is deliberately ambiguous?. They also misread presupposition triggers and non-factive verbs ("he *pretended* to leave" vs. "he *managed* to leave"), treating these context-shifting cues as surface patterns rather than computing how they flip an inference Why do embedding contexts confuse LLM entailment predictions?. And in open-ended perspective-taking, LLMs default to surface strategies instead of genuinely modeling another mind — notably, architectures that *force* explicit belief-tracking outperform LLMs alone, suggesting the gap is structural, not just a matter of more training Do large language models genuinely simulate mental states?.

Why this pattern? Several notes point to the same root: these models reason by semantic association over their training distribution, not by manipulating structure. Strip the familiar semantics out of a task and performance collapses even when the rules are handed to the model Do large language models reason symbolically or semantically?. Entailment judgments lean on whether a conclusion *looks attested* in training data rather than whether the premise actually supports it Do LLMs predict entailment based on what they memorized?. Pragmatic inference demands exactly the structural, context-conditional computation these failures reveal is missing — which is why "potemkin understanding" shows up: a model can correctly *explain* a pragmatic concept, fail to *apply* it, and even recognize the failure, a triple incoherence no human shows Can LLMs understand concepts they cannot apply?.

Here's the twist worth carrying away. The deficit is selective, not total. LLMs handle *causal* relations well because causal connectives are explicit and frequent in text, while *temporal* ordering — which must be inferred from context — trails behind Why do LLMs handle causal reasoning better than temporal reasoning?. Pragmatics is the hard case for the same reason temporal reasoning is: the load-bearing signal is *implicit*, exactly what compression-from-text doesn't capture well. And yet, on the modeling side, LLMs fine-tuned on psychology-experiment data predict human decisions better than purpose-built cognitive theories Can language models learn to model human decision making?. So the picture splits: a model can be a strong *external predictor* of how situated humans behave while remaining a poor *internal performer* of the situated inference itself.

If you want the deepest cut, two notes reframe the whole question. One argues LLMs operationalize Saussure's *langue* — meaning as pure relational structure compressed from text, with no external referents — which would explain why situational grounding is precisely what's absent Can language models learn meaning without engaging the world?. The other, via Habermas, suggests that from the *observer's* view humans and LLMs differ categorically, but as *participants in shared discourse* they draw on the same symbolic substrate — making the difference structural rather than absolute Do humans and LLMs differ fundamentally or just superficially?. Read together, they suggest the answer isn't a flat "no" but "not the same way": LLMs infer from the relational shadow language casts, while human pragmatics is anchored in a situation the model never actually occupies.


Sources 11 notes

Can language models adapt implicature to conversational context?

ChatGPT shows no context-sensitivity in computing scalar implicatures across three dimensions: explicit literal-mode instructions, information structure focus, and face-threatening contexts. Humans flexibly modulate these inferences; the model does not, suggesting pragmatic competence requires tracking communicative stakes that LLMs systematically miss.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Why do embedding contexts confuse LLM entailment predictions?

LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do LLMs predict entailment based on what they memorized?

McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Why do LLMs handle causal reasoning better than temporal reasoning?

ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.

Can language models learn to model human decision making?

LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Do humans and LLMs differ fundamentally or just superficially?

Applied Habermas's observer/participant distinction to AI: from outside, humans and LLMs are utterly different; from within shared discourse, both draw on the same symbolic substrate, making the difference structural rather than absolute.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a pragmatics researcher re-testing claims about LLM situational reasoning against the latest evidence. The question remains: Can LLMs infer situational context the way humans do pragmatically?

What a curated library found — and when (findings span 2022–2026; treat as dated claims, not current truth):
• Scalar implicature: ChatGPT computes "some" → "not all" but shows zero sensitivity to contextual dials that flip the inference in human speech (2022–2023).
• Ambiguity & perspective: GPT-4 disambiguates only 32% vs. 90% human; models treat presupposition shifts as surface patterns, not context-flipping rules (2023).
• Root cause: LLMs compress text semantics relationally, not structurally; strip familiar meanings and performance collapses; causal reasoning outpaces temporal (context-dependent) reasoning (2023–2025).
• Potemkin understanding: Models explain pragmatic concepts correctly, fail to apply them, and recognize the failure — triple incoherence absent in humans (2023).
• Theory of mind: Architectures forcing explicit belief-tracking beat LLMs alone; models fine-tuned on psychology data predict human behavior better than they perform situated inference (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2210.14986 (2022) — Goldilocks of pragmatic understanding
• arXiv:2305.14825 (2023) — In-context semantic vs. symbolic reasoning
• arXiv:2502.08796 (2025) — Systematic review: LLMs in theory of mind
• arXiv:2602.06176 (2026) — Reasoning failures

Your task:
(1) RE-TEST EACH CONSTRAINT. For scalar implicature, ambiguity recognition, presupposition-handling, and theory-of-mind tasks: has post-2025 work (new architectures, training regimes, in-context techniques, multi-agent orchestration, or interpretability tooling) relaxed or overturned the core finding that models lack situational steering? Separate the durable question (pragmatics as context-conditional inference) from the perishable limitation (current models can't do it). Cite what resolved or entrenched each gap.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has mechanistic interpretability (2025–2026) revealed hidden context-tracking? Have scaling, retrieval augmentation, or explicit planning methods moved the needle on pragmatic tasks?
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "If LLMs now track situational variables implicitly, what objective would reveal it?" or "Does multi-agent discourse (agent modeling agent) bootstrap pragmatic inference absent in single-model tasks?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines