INQUIRING LINE

What specific linguistic features cause LLMs to fail at trivial entailment?

This explores which concrete features of a sentence — not abstract "difficulty" — flip an LLM's entailment judgment, even when the logic is dead simple.


This explores which concrete features of a sentence trip up LLMs on entailment tasks that should be trivial. The corpus points to a handful of specific culprits, and they cluster around one theme: features whose meaning depends on *structure* rather than *surface words*.

The sharpest example is non-factive verbs and presupposition triggers. Words like "believe," "claim," or "pretend" change whether a premise actually supports a conclusion — "She pretended to leave" does not entail "She left." LLMs tend to read these as ordinary surface cues and miss that they flip the entailment, a pattern that holds across prompts and models Why do embedding contexts confuse LLM entailment predictions?. A related failure: when a sentence smuggles in a false assumption ("When did you stop X-ing?"), models accommodate it rather than reject it — even when a direct question proves they know the assumption is false Why do language models accept false assumptions they know are wrong?.

A second cluster is syntactic embedding. Embedded clauses, complex nominals, and recursive structure degrade performance predictably as depth increases — simple sentences are handled fine, nested ones break consistently Why do large language models fail at complex linguistic tasks?Does LLM grammatical performance decline with structural complexity?. The breakdown isn't random; it scales with structure, which is the tell that models learned surface heuristics instead of grammatical rules Where exactly do language models fail at structural language tasks?.

But the corpus suggests the deepest cause sits underneath the linguistics: what looks like an entailment failure is often a memory-versus-reasoning failure. McKenna et al. found "attestation bias" — models predict entailment based on whether the *hypothesis* appears in training data, not whether the premise supports it. Feed them a random, irrelevant premise and they still say "entails" as long as the conclusion sounds familiar Do LLMs predict entailment based on what they memorized?. That's why decoupling meaning from logic collapses performance: models reason through semantic association, not symbolic manipulation Do large language models reason symbolically or semantically?.

The surprising twist — and the thing worth taking away — is that "trivial" is doing real work in the question. On *multi-hop* reasoning across long contexts, LLMs can beat humans; it's the short, one-step deductions where they fall behind Why do LLMs fail at simple deductive reasoning?. So the features that break entailment aren't the complicated ones. They're the small structural hinges — a non-factive verb, a buried presupposition, an unstated precondition the model never enumerates Do language models fail at identifying unstated preconditions? — where surface familiarity quietly overrides logic.


Sources 9 notes

Why do embedding contexts confuse LLM entailment predictions?

LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Where exactly do language models fail at structural language tasks?

Language models excel with explicit discourse markers and simple grammar but fail predictably on implicit relations, embedded structures, and forward-planning discourse. These breakdowns map to failures in discourse intentionality and attention layers, not just linguistic surface structure.

Do LLMs predict entailment based on what they memorized?

McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Why do LLMs fail at simple deductive reasoning?

The Minds vs. Machines benchmark shows LLMs excel at integrating information across multiple sentences while humans outperform them on straightforward logical inference. Capability type, not complexity level, determines who performs better.

Do language models fail at identifying unstated preconditions?

LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating constraints on LLM entailment reasoning. The precise question: *Which linguistic features — not task complexity — cause LLMs to fail at trivial, one-step entailment?* This remains open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable benchmarks:
- Non-factive verbs ("believe," "pretend") and presupposition triggers systematically flip entailment; models miss the structural flip and read surface cues instead (~2023–2025).
- Syntactic embedding and recursive depth degrade performance predictably; performance scales with structural complexity, not semantic difficulty (~2023–2024).
- "Attestation bias": models predict entailment based on whether the *hypothesis* appears in training data, not whether the premise supports it — a memory heuristic, not logical reasoning (~2024–2025).
- LLMs excel at multi-hop reasoning over long contexts but fail at short, one-step deductions; the failure is concentrated in *surface-familiar* premises masking false presuppositions (~2024–2025).
- Enumeration of unstated preconditions — the core of trivial inference — remains a systematic blind spot (~2024–2026).

Anchor papers (verify; mind their dates):
- arXiv:2305.14825 (2023): "In-Context Semantic Reasoners rather than Symbolic Reasoners"
- arXiv:2503.19260 (2025): "Linguistic Blind Spots of Large Language Models"
- arXiv:2602.06176 (2026): "Large Language Model Reasoning Failures"
- arXiv:2506.08952 (2025): "Can LLMs Ground when they (Don't) Know"

Your task:
(1) RE-TEST EACH CONSTRAINT. For presupposition triggers, syntactic embedding, and attestation bias: have newer models (o1, Sonnet 4, Grok-3) or training methods (process reward models, synthetic fine-tuning on entailment) relaxed these failures? Does chain-of-thought or explicit negation-checking override attestation bias? Separate the durable problem — *why* models confuse surface familiarity with logical support — from the perishable limitation (solvable via prompting or architecture). Cite what resolved it.
(2) Surface the strongest CONTRADICTING work from the last 6 months: any papers claiming LLMs now reliably handle non-factives, embedded presuppositions, or precondition enumeration? What changed — scaling, training data curation, or evaluation methodology?
(3) Propose 2 research questions that assume the constraint may have shifted: (a) Does fine-tuning on presupposition-aware datasets (or negation-aware RL) now decouple attestation bias from entailment prediction? (b) Can multi-agent scaffolding (one agent lists presuppositions, another checks them) dissolve the enumeration blind spot without retraining?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines