INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do language models represent m…›Why do language models struggle wi…›this inquiring line

Words like 'stopped' or 'realized' sneak hidden assumptions into sentences — and AI models reliably miss them, breaking their logic.

How do embedding contexts like presupposition triggers affect LLM entailment reasoning?

This explores why phrases that quietly smuggle in assumptions — 'presupposition triggers' like *stopped*, *realized*, or non-factive verbs like *believes* — throw off an LLM's ability to judge what logically follows from a sentence, and what that failure reveals about how these models reason at all.

This explores why certain linguistic frames — words that quietly carry hidden assumptions — break an LLM's ability to judge what a sentence actually entails. The short version: models treat these triggers as surface cues rather than computing their real semantic effect, and the same pattern of 'reading the words, not the structure' shows up across the whole corpus.

The central finding is that presupposition triggers and non-factive verbs act as *embedding blinds* — the model sees the trigger word and pattern-matches to a default inference instead of working out the opposite effect the embedding context should have Why do embedding contexts confuse LLM entailment predictions?. That's not a one-off glitch. Models also fail to reject false presuppositions even when they demonstrably know the correct fact — GPT-4 pushes back only 84% of the time, Mistral a startling 2.44% — because a smuggled-in assumption drives more acceptance than true knowledge drives rejection Why do language models accept false assumptions they know are wrong?. And some presuppositions don't live in the trigger word at all; they emerge from the flow of conversation, which requires tracking the question under discussion rather than matching trigger-to-inference — a kind of reasoning LLMs miss by design Do language models miss presuppositions that arise from context?.

What makes this more than a niche linguistics problem is *why* it happens. Entailment judgments turn out to lean heavily on whether the hypothesis looks familiar — McKenna's 'attestation bias' shows models predict entailment based on whether the conclusion appears memorized from training, not on whether the premise supports it Do LLMs predict entailment based on what they memorized?. Strip the familiar semantics out of a reasoning task and performance collapses even when the correct rules are sitting right there in the prompt, because these systems reason through semantic association, not symbolic manipulation Do large language models reason symbolically or semantically?. Presupposition triggers are simply a clean place to catch that surface-bias in the act.

The corpus also frames this as one face of a deeper structural blindness. The same models systematically misread embedded clauses and complex nominals, and get predictably worse as syntactic depth increases Why do large language models fail at complex linguistic tasks? — and they fail to bring *unstated* preconditions forward as relevant, a modern echo of the AI 'frame problem' Do language models fail at identifying unstated preconditions?. The hopeful thread: forcing the implicit to become explicit helps a lot. Making models enumerate hidden preconditions lifts accuracy from 30% to 85% Do language models fail at identifying unstated preconditions?, and structured argument prompts that force a model to check its warrants catch failures plain chain-of-thought lets slide Can structured argument prompts make LLM reasoning more rigorous?.

The thing you didn't know you wanted to know: an LLM failing on the word *stopped* isn't a vocabulary gap — it's the same mechanism that makes it accept a false premise it could refute, and the same one that makes it guess 'entailed' because the conclusion sounds memorized. Presupposition is just where the surface-pattern habit becomes visible enough to measure.

Sources 8 notes

Why do embedding contexts confuse LLM entailment predictions?

LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Do language models miss presuppositions that arise from context?

LLMs learn statistical associations between trigger words and inferences, but presuppositions also arise through accommodation—updating context to resolve discourse mismatches. Models miss these because they require tracking questions under discussion, not pattern matching.

Do LLMs predict entailment based on what they memorized?

McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Show all 8 sources

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Do language models fail at identifying unstated preconditions?

LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Do embedding contexts like presupposition triggers fundamentally limit LLM entailment reasoning, or have newer models, training methods, or prompting architectures since mid-2024 relaxed these constraints?**

What a curated library found — and when (dated claims, not current truth):

Findings span 2023–2025; treat as perishable baseline.

• Presupposition triggers act as *embedding blinds*: models pattern-match to surface cues rather than compute semantic effects of embedding contexts (2023–2025).
• Rejection of false presuppositions remains weak across models—GPT-4 succeeds ~84% of the time; Mistral only 2.44%—because smuggled assumptions override factual knowledge (~2025).
• LLM entailment judgments lean on *attestation bias*: models predict entailment based on hypothesis familiarity/memorization, not premise-to-conclusion support (2024–2025).
• Explicit enumeration of hidden preconditions lifts accuracy from ~30% to ~85%; structured argumentative prompts (CQoT-style) catch failures chain-of-thought misses (~2024–2025).
• Systematic linguistic blind spots worsen predictably with syntactic depth; models fail to surface unstated preconditions (modern frame problem) (2023–2025).

Anchor papers (verify; mind their dates):
- arXiv:2305.14785 (2023-05): Simple Linguistic Inferences of LLMs: Blind Spots and Blinds
- arXiv:2305.14825 (2023-05): LLMs are In-Context Semantic Reasoners rather than Symbolic Reasoners
- arXiv:2412.15177 (2024-12): Critical-Questions-of-Thought: Argumentative Querying
- arXiv:2505.22354 (2025-05): LLMs Struggle to Reject False Presuppositions

Your task:

(1) **RE-TEST EACH CONSTRAINT.** For presupposition blindness, attestation bias, and frame-problem failures: Has instruction-tuning (e.g., Llama 3.1+), reasoning-layer training (o1-style latent-space reasoning per arXiv:2412.06769), or long-context + retrieval (per arXiv:2406.13121) since mid-2024 *relaxed* these? Distinguish durable question (do embeddings still pose structural challenges?) from perishable claim (specific rejection %s, exact accuracy floors). Cite what relaxed it.

(2) **Surface the strongest contradicting or superseding work from the last ~6 months.** Has any paper shown that scaled models, chain-of-reasoning training, or grounding mechanisms (arXiv:2506.08952) have overturned the attestation-bias or embedding-blind findings?

(3) **Propose 2 research questions that assume the regime may have moved:**
   - If presupposition blindness persists despite scaling, what architectural change (symbolic module, external knowledge graph, multi-step grounding) would address it?
   - Can structured prompting (CQoT, argumentation schemes) now **suppress** attestation bias systematically across diverse domains, or does it remain brittle to distribution shift?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Words like 'stopped' or 'realized' sneak hidden assumptions into sentences — and AI models reliably miss them, breaking their logic.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8