SYNTHESIS NOTE

Topics›Natural Language Inference›this note

Why do embedding contexts confuse LLM entailment predictions?

Can language models distinguish between contexts that preserve versus cancel entailments? The study explores whether LLMs systematically fail to apply the semantic rules governing presupposition triggers and non-factive verbs.

Synthesis note · 2026-02-21 · sourced from Natural Language Inference

"Simple Linguistic Inferences of LLMs" targets inferences humans find trivial — grammatically-specified entailments ("You've eaten all my apples" entails "Someone ate something"), evidential adverbs of uncertainty ("allegedly" cancels the entailment of the clause), and monotonicity entailments (specific→general). LLMs show moderate-to-low performance on all three.

But the more revealing finding is what happens when the premise is embedded in grammatical contexts. Two types of embedding contexts should have opposite effects:

Presupposition triggers (factive verbs: "realized that", "regret that"; temporal clauses: "before X"): embedding under these should not change the original entailment relations — the premise's entailments are preserved because presuppositions project through these contexts.
Non-factive verbs (believe, imagine, suspect, feel): embedding under these should cancel entailments — "I suspect a balloon hit a light post" no longer entails "something hit a light post."

LLMs cannot make this discrimination. ChatGPT in regular prompting mode treats both presupposition triggers and non-factives as hints toward entailment. In chain-of-thought mode, it treats both as hints against entailment. The embedding context overwhelms the semantics of the embedded content, acting as a "blind" that masks the relevant inferential relationships.

This is a different kind of failure from general reasoning difficulty — these are structural failures where syntactic packaging overrides semantic content. The model responds to the embedding verb (factive vs. non-factive) as a surface cue rather than computing its effect on the entailment relation. This is precisely the pattern Can models pass tests while missing the actual grammar? predicts: surface cues substituting for structural analysis.

The persistence across multiple prompts and LLMs confirms this is systematic, not incidental — "a systematic issue" in the paper's words.

Inquiring lines that read this note 39

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How do training priors constrain what context information can override?

How do language models establish social grounding in human dialogue?

Can LLMs infer situational context the way humans do pragmatically?

Do language models understand semantics or rely on pattern matching?

Why do language models struggle with implicit discourse relations?

How should retrieval systems optimize for multi-step reasoning during inference?

How do entailment checks prevent synthetic data from degrading retrieval corpora?

How do language models inherit human biases from training data?

How does removing a spurious cue change LLM performance?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

Why do language models reinforce false assumptions instead of correcting them?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

What distinguishes entity errors from relation errors in LLM output?

Is embodied interaction necessary for language meaning and genuine agency?

How does frame selection differ from frame application in meaning-making?

What makes dialogue-based explanation more successful than monologue?

How does the Question Under Discussion shape what counts as presupposed?

What mechanisms enable AI systems to generate and spread false beliefs?

Why do non-factive verbs and triggers both fool language models?

Can prompting strategies overcome LLM biases without model fine-tuning?

How do structured prompts force LLMs to check for contradictions in evidence?

Can AI-generated outputs constitute genuine knowledge or valid claims?

Why does context work differently in AI than in conventional software?

Why do reasoning models fail at systematic problem-solving and search?

How do dependency errors propagate through incorrectly formalized definitions?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 99 in 2-hop network ·medium cluster Open in graph ↗

Why do embedding contexts confuse LLM entailment… Can models pass tests while missing the actual gra… Does LLM grammatical performance decline with stru… Why does ChatGPT fail at implicit discourse relati…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models pass tests while missing the actual grammar? Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
same mechanism: surface context cues substituting for structural computation
Does LLM grammatical performance decline with structural complexity? This explores whether LLMs fail uniformly at grammar or whether their failures follow a predictable pattern tied to input complexity. Understanding the relationship matters for deciding when LLM annotations are reliable.
embedding contexts add structural complexity; this is another specific complexity type that causes systematic failure
Why does ChatGPT fail at implicit discourse relations? ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?
parallel structure: surface markers (connectives, embedding verbs) override deeper semantic computation

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

presupposition triggers and non-factive verbs are embedding blinds that systematically miscalibrate llm entailment predictions

Why do embedding contexts confuse LLM entailment predictions?

Inquiring lines that read this note 39

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4