SYNTHESIS NOTE

Do language models fail reasoning tests that humans pass?

Standard critiques claim LLMs lack real reasoning ability, but do humans actually perform better on content-independent reasoning tasks? Examining whether the cognitive bar differs for artificial versus human intelligence.

Synthesis note · 2026-05-02 · sourced from Linguistics, NLP, NLU

Lampinen et al. relitigate a fifty-year cognitive-science debate using LLM behavior as the new evidence. The classical symbolist line (Marcus, Fodor) defines abstract reasoning as content-independent: "X is bigger than Y" implies "Y is smaller than X" regardless of what X and Y are, and a system whose reasoning depends on the values of X and Y is not really reasoning. By that criterion, current LLMs fail. But the inconvenient parallel evidence Lampinen marshals is that humans fail it too — across Wason, syllogisms, and NLI, human reasoning is heavily content-sensitive in exactly the patterns LMs show.

The conclusion forks. Either the criterion is wrong, or human cognition isn't doing what the symbolist account claims it does. Lampinen leans toward the former: if humans and LMs both succeed and fail along the same content-form axis, the connectionist account where inferences are grounded in learned semantics may describe both better than the symbolist account describes either. This converges with llm semantic grounding is tri-partite — functional grounding is strong social grounding is weak causal grounding is indirect — the grounding picture is more nuanced than "absent or present," and the same nuance applies to human reasoning, just with different mixtures.

For Language as Event, this insight is load-bearing. The standard critique — "LLMs don't really reason, they just match patterns" — collapses into a parallel claim about humans: humans also don't reason in pure logical form; we reason in patterns weighted by semantic content, and we reach correct logical conclusions partly by being lucky that the content supports them. In Saussurean terms: there is no actual reasoner that operates over pure langue. Reasoning always happens in parole — in particular utterances with particular content. The content effects literature is the empirical evidence that langue/parole separation breaks at the cognitive level too, not just at the linguistic level.

The symmetry claim does not absolve LLMs of their distinctive failure modes. It does block one specific framing: "LLMs fail where humans succeed" is not what the data show. The data show: both succeed and fail along the same content-form axis. Where they diverge is elsewhere — in the override capacity, in the handling of novel structure, in the relation to grounded experience — but content-sensitivity itself is shared, and using it as the criterion for distinguishing real reasoning from fake reasoning fails the test on humans.

Inquiring lines that read this note 22

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why can't humans reliably detect AI-generated text despite measurable linguistic signatures?

What makes specific clarifying questions more effective than generic ones?

Why does item discrimination matter more than surface-level question plausibility?

Why do benchmark improvements fail to reflect actual reasoning quality?

Why do reasoning models fail at systematic problem-solving and search?

Is embodied interaction necessary for language meaning and genuine agency?

Does the langue-parole distinction apply to human reasoning too?

How do language models establish social grounding in human dialogue?

What cognitive capacities do LLMs actually lack that commentary assumes they have?

How do training data properties shape reasoning capability development?

Can reasoning skills trained on law improve performance in STEM?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Do language models develop causal world models or rely on statistical patterns?

How do internal representations compare to human cognitive structures?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

How do language models inherit human biases from training data?

Which knowledge types do LLMs handle better than humans in reasoning tasks?

How do we evaluate AI systems when user perception misleads actual performance?

Does AI fluency substitute for verifiable accuracy in human judgment?

Does the Turing test actually measure intelligence or just mimicry?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

Why do LLMs fail at faithful autoformalisation of reasoning problems?

Related concepts in this collection 1

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 105 in 2-hop network ·medium cluster Open in graph ↗

Do language models fail reasoning tests that hum… Do large language models reason symbolically or se…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do large language models reason symbolically or semantically? Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.
same property described from the LLM side

Do language models fail reasoning tests that humans pass?

Inquiring lines that read this note 22

Related concepts in this collection 1

Related papers in this collection 8

Search by related questions 4