SYNTHESIS NOTE

Do language models fail at identifying unstated preconditions?

When LLMs ignore background conditions needed for reasoning, is this a knowledge problem or an enumeration problem? Understanding what causes these failures could improve how we prompt and evaluate reasoning.

Synthesis note · 2026-05-01 · sourced from Reasoning Critiques

The classical frame problem (McCarthy and Hayes, 1981) asks how a reasoning system decides which background conditions are relevant when reasoning about an action. Most things stay the same when an action is performed — the frame — and the system needs to know which non-trivial things change without being told. Solving this in symbolic AI required either explicit frame axioms (combinatorially expensive) or non-monotonic logics (mathematically delicate).

The Heuristic Override Benchmark identifies a contemporary version of the same problem in LLMs. When a user asks "should I walk or drive to the car wash 50m away," the relevant unstated condition is that the car must be at the car wash. This is a feasibility precondition that no human would need to state because it is presupposed by the entire setup. The model fails not because it lacks the world knowledge — it has it — but because it does not bring this background condition forward as relevant when the surface heuristic ("50m is walkable") is active.

This reframes the failure. It is not noise filtering (the standard shortcut-learning frame). It is not knowledge retrieval (the standard hallucination frame). It is enumeration: which of the indefinitely many things I know about the world should I treat as live constraints on this decision? Structured prompting that forces enumeration ("what must be true for walking to be feasible?") raises accuracy from around 30 percent to 85 percent on single instances. The intervention works precisely because it externalizes the enumeration step the model cannot reliably perform on its own.

The frame problem was once thought specific to symbolic systems. The HOB results suggest it persists, in different form, in statistical systems trained on language. The substrate changed; the structural problem did not.

Inquiring lines that read this note 44

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Why do language models reinforce false assumptions instead of correcting them?

Should LLMs query users back when presented with under-specified scenarios?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

Why do language models struggle with implicit discourse relations?

Can prompting inject entirely new knowledge into language models?

Do base models contain latent reasoning that training can unlock?

What other latent LLM capabilities remain inactive without explicit activation cuing?

How does example difficulty affect learning efficiency in language models?

What makes a problem instance unfamiliar to a language model?

How effectively do deterministic tools improve language model reasoning on formal tasks?

Why do reasoning models fail at systematic problem-solving and search?

How can models identify insufficient information and respond appropriately without guessing?

Can models identify what information they are missing in underspecified problems?

What critical LLM failures do standard benchmarks hide?

How should retrieval systems optimize for multi-step reasoning during inference?

Can prompt engineering and external knowledge bases fix ambiguity recognition failures?

Do language models understand semantics or rely on pattern matching?

What makes dialogue-based explanation more successful than monologue?

How does the Question Under Discussion shape what counts as presupposed?

Can prompting strategies overcome LLM biases without model fine-tuning?

How do structured prompts force LLMs to check for contradictions in evidence?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

How do prompt structure and constraints affect model instruction reliability?

What failure modes does the negative-space checklist generation method actually catch?

What memory architectures best support persistent reasoning across extended interactions?

Why do LLMs strip applicability conditions during memory abstraction?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

Why do LLMs reason fluently about causality but lack causal rigor?

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

The modern frame problem manifests as enumeration failure of unstated preconditions not noise filtering

Do language models fail at identifying unstated preconditions?

Inquiring lines that read this note 44

Related papers in this collection 8

Search by related questions 4