INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do language models represent m…›Why do language models struggle wi…›this inquiring line

Humans dial 'some means not all' up or down based on context — AI just fires the same rule every time.

How do fixed pragmatic templates prevent models from understanding context?

This explores why LLMs seem to apply fixed, surface-level pragmatic rules — for things like implicature, presupposition, and unstated background conditions — instead of flexibly reading what a given context actually calls for.

This explores why LLMs apply fixed pragmatic rules rather than reading context flexibly. The clearest case study is scalar implicature — the everyday inference that 'some students passed' usually means 'not all.' Humans dial this inference up or down depending on the stakes of the conversation: whether literal precision is demanded, what's in focus, or whether bluntness would be socially costly. The corpus finds ChatGPT does none of this dialing — it computes the same implicature regardless of communicative context, suggesting it has learned a default template rather than the underlying skill of tracking what a speaker means Can language models adapt implicature to conversational context?. The template fires; the context is ignored.

The same shape shows up in how models handle presupposition. Constructions like 'X stopped doing Y' or non-factive verbs ('claimed,' 'believed') are supposed to flip what's entailed, but models read them as surface cues instead of computing their actual semantic effect — they act as systematic 'blinds' that persist across prompts and models Why do embedding contexts confuse LLM entailment predictions?. More strikingly, when a question smuggles in a false assumption, models tend to play along and accommodate it — even when a direct factual question proves they know the assumption is wrong Why do language models accept false assumptions they know are wrong?. The conversational template ('answer the question as posed') overrides the knowledge the model demonstrably has.

Lateral to this is the frame problem: the things a context leaves unstated. Models struggle not because they lack world knowledge but because they fail to bring relevant background conditions forward as constraints — and when you force them to explicitly enumerate those preconditions, accuracy jumps from 30% to 85% Do language models fail at identifying unstated preconditions?. That gap is the tell. The knowledge is there; the default response pattern just doesn't reach for it unless the prompt scaffolds the reach. Context isn't 'understood' so much as it has to be manually unpacked into the surface text.

There's a deeper mechanism underneath all of this. Models fail to integrate context when their training priors are strong enough to dominate — and textual prompting alone can't override those priors; it takes causal intervention in the representations themselves Why do language models ignore information in their context?. So a 'fixed template' isn't a stylistic quirk; it's the parametric default winning out over the in-context signal. A related framing calls this context collapse: when a query is underspecified, the model falls back to blended training-data priors rather than the specific situation in front of it Why do large language models produce generic responses to vague queries?.

What makes this worth knowing: the failure isn't ignorance, it's a disconnect between knowing and applying. The same incoherence appears in 'potemkin understanding,' where a model explains a concept correctly, fails to apply it, and can even recognize the failure — a pattern that points to functionally separated explanation and execution pathways Can LLMs understand concepts they cannot apply?. Pragmatic competence requires tracking communicative stakes in real time. A template gives you the average answer for the average context — which is exactly why it looks fluent and still misses the room.

Sources 7 notes

Can language models adapt implicature to conversational context?

ChatGPT shows no context-sensitivity in computing scalar implicatures across three dimensions: explicit literal-mode instructions, information structure focus, and face-threatening contexts. Humans flexibly modulate these inferences; the model does not, suggesting pragmatic competence requires tracking communicative stakes that LLMs systematically miss.

Why do embedding contexts confuse LLM entailment predictions?

LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Do language models fail at identifying unstated preconditions?

LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Show all 7 sources

Why do large language models produce generic responses to vague queries?

Unlike social-media context collapse, which flattens multiple audiences, LLM collapse occurs when users provide insufficient contextual scaffolding and models default to blended training-data priors. This distinction suggests remedies should focus on query verification and user-driven context specification rather than platform controls.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **Do fixed pragmatic templates prevent models from understanding context, or have newer training, inference, or architectural shifts dissolved this constraint?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026. A library of work on pragmatic reasoning, presupposition, and context collapse identified:
- ChatGPT computes scalar implicature (e.g., 'some' → 'not all') invariantly across contexts, ignoring communicative stakes; the same template fires regardless of whether precision or politeness is demanded (~2022–2023).
- Models treat presupposition triggers and non-factive verbs as opaque surface cues rather than computing their semantic effect; they accommodate false presuppositions even when direct factual queries show they 'know' the assumption is wrong (~2023).
- Frame problem / enumeration failure: when forced to explicitly list unstated preconditions, accuracy jumps from ~30% to ~85%, suggesting knowledge exists but default patterns don't activate it unless scaffolded (~2024).
- Context collapse and prior domination: models fall back to training-data priors rather than integrating in-context signals when queries are underspecified; causal intervention in representations is needed to override parametric defaults (~2024–2025).
- Potemkin understanding: models explain concepts correctly, fail to apply them, and recognize the failure — a functional split between explanation and execution (~2024).

Anchor papers (verify; mind their dates):
- arXiv:2210.14986 (2022-10): The Goldilocks of Pragmatic Understanding
- arXiv:2305.14785 (2023-05): Simple Linguistic Inferences — Blind Spots and Blinds
- arXiv:2404.01869 (2024-04): Beyond Accuracy — Reasoning Behavior Survey
- arXiv:2506.08952 (2025-06): Can LLMs Ground when they (Don't) Know

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For scalar implicature, presupposition, and frame enumeration: has test-time compute scaling (e.g., latent reasoning, chain-of-thought depth), retrieval-augmented generation, sparse autoencoders for representational control, or recent architectural changes (e.g., recursive / recurrent depths, new attention variants) since relaxed or overturned these failures? Separate the durable question (how pragmatic flexibility arises) from the perishable limitation (current models can't dial it). Cite what shifted it, and state plainly where the constraint still holds.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Look for papers showing that context *does* integrate in presupposition/implicature tasks, or that alignment methods (conversational, multi-agent orchestration, or knowledge-grounding) have restored pragmatic flexibility.
(3) **Propose 2 research questions that ASSUME the regime may have moved.** E.g., if test-time compute or sparse-ae interventions have partly solved this: (a) what is the *cost* of that solution at scale, and (b) what *new* pragmatic failure emerges when you push to higher-order implicature or mixed communicative goals?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Humans dial 'some means not all' up or down based on context — AI just fires the same rule every time.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8