What specific linguistic features cause LLMs to fail at trivial entailment?
This explores which concrete features of a sentence — not abstract "difficulty" — flip an LLM's entailment judgment, even when the logic is dead simple.
This explores which concrete features of a sentence trip up LLMs on entailment tasks that should be trivial. The corpus points to a handful of specific culprits, and they cluster around one theme: features whose meaning depends on *structure* rather than *surface words*.
The sharpest example is non-factive verbs and presupposition triggers. Words like "believe," "claim," or "pretend" change whether a premise actually supports a conclusion — "She pretended to leave" does not entail "She left." LLMs tend to read these as ordinary surface cues and miss that they flip the entailment, a pattern that holds across prompts and models Why do embedding contexts confuse LLM entailment predictions?. A related failure: when a sentence smuggles in a false assumption ("When did you stop X-ing?"), models accommodate it rather than reject it — even when a direct question proves they know the assumption is false Why do language models accept false assumptions they know are wrong?.
A second cluster is syntactic embedding. Embedded clauses, complex nominals, and recursive structure degrade performance predictably as depth increases — simple sentences are handled fine, nested ones break consistently Why do large language models fail at complex linguistic tasks?Does LLM grammatical performance decline with structural complexity?. The breakdown isn't random; it scales with structure, which is the tell that models learned surface heuristics instead of grammatical rules Where exactly do language models fail at structural language tasks?.
But the corpus suggests the deepest cause sits underneath the linguistics: what looks like an entailment failure is often a memory-versus-reasoning failure. McKenna et al. found "attestation bias" — models predict entailment based on whether the *hypothesis* appears in training data, not whether the premise supports it. Feed them a random, irrelevant premise and they still say "entails" as long as the conclusion sounds familiar Do LLMs predict entailment based on what they memorized?. That's why decoupling meaning from logic collapses performance: models reason through semantic association, not symbolic manipulation Do large language models reason symbolically or semantically?.
The surprising twist — and the thing worth taking away — is that "trivial" is doing real work in the question. On *multi-hop* reasoning across long contexts, LLMs can beat humans; it's the short, one-step deductions where they fall behind Why do LLMs fail at simple deductive reasoning?. So the features that break entailment aren't the complicated ones. They're the small structural hinges — a non-factive verb, a buried presupposition, an unstated precondition the model never enumerates Do language models fail at identifying unstated preconditions? — where surface familiarity quietly overrides logic.
Sources 9 notes
LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.
The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.
Language models excel with explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and forward-planning discourse. These breakdowns map to failures in discourse intentionality and attention layers, not just linguistic surface structure.
McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
The Minds vs. Machines benchmark shows LLMs excel at integrating information across multiple sentences while humans outperform them on straightforward logical inference. Capability type, not complexity level, determines who performs better.
LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.