SYNTHESIS NOTE

Do harder reasoning tasks trigger more semantic bias?

Does the difficulty of a logical task determine how much semantic content influences reasoning? This matters because it reveals whether we can isolate 'pure' logical reasoning in benchmarks.

Synthesis note · 2026-05-02 · sourced from Linguistics, NLP, NLU

Lampinen et al. observe a difficulty-modulation pattern: content effects are weakest on NLI (a relatively simple inference task), stronger on syllogism validity judgment, and strongest on the Wason selection task — which is the hardest, even for mathematics undergraduates and academic mathematicians who score below 50% on its abstract version. The directional claim is clean: as the logical demands of the task exceed available working-memory or circuit capacity, the system falls back on semantic priors. Both humans and LMs show this fallback in the same direction along the same difficulty axis.

The pattern explains a recurring frustration with reasoning benchmarks. Benchmarks designed to test "purely logical" reasoning still show heavy content sensitivity, and benchmark designers often treat this as a confound to be controlled. The Lampinen finding suggests it cannot be controlled — content sensitivity is more pronounced exactly where the benchmark is most demanding. The harder the task, the more believability bleeds into the result. A reasoning benchmark whose items vary in content believability is partly a believability test, not a logic test, and the harder the items the more this is true.

The connection to Why do LLMs fail at simple deductive reasoning? is partial but illuminating. That note shows LMs and humans diverge on certain reasoning surfaces — long multi-hop versus simple deduction. Lampinen shows they converge on the difficulty-modulation pattern itself, even where their absolute capabilities differ. Both observations can be true: humans and LMs occupy different absolute positions on a difficulty curve, but both slide toward semantic-fallback as difficulty rises.

For False Punditry, the connection is straightforward and uncomfortable. Pundits and LLMs both reach for plausible-sounding content when underlying logic is hard, by the same failure-mode mechanism. The pundit who confidently restates a familiar belief when asked a hard question, and the LLM that confabulates a believable answer when the logic exceeds its circuits, are not analogically similar — they are mechanistically similar. Both are systems whose reasoning capacity has been exceeded and which fall back on a semantic prior that sounds right. Recognizing this similarity is more diagnostically useful than insisting on the difference.

Inquiring lines that read this note 6

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

What makes emotional alignment more effective than logic when reasoning errors are exposed?

Can prompting inject entirely new knowledge into language models?

Does irrelevant content degrade reasoning even when it fits the context window?

Why do reasoning models fail at systematic problem-solving and search?

What makes a background condition relevant to a specific reasoning task?

What factors beyond surface content determine how readers extract meaning differently?

What makes semantic attacks harder to defend against than algorithmic ones?

When do additional thinking tokens stop improving reasoning performance?

What causes reasoning quality to degrade during long research tasks?

How does example difficulty affect learning efficiency in language models?

Why does target probability matter more than task logical complexity?

Related concepts in this collection 1

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 132 in 2-hop network ·dense cluster Open in graph ↗

Do harder reasoning tasks trigger more semantic … Why do LLMs fail at simple deductive reasoning?

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do LLMs fail at simple deductive reasoning? LLMs excel at complex multi-hop reasoning across sentences but struggle with trivial deductions humans find obvious. What explains this counterintuitive reversal in capability?
divergence on absolute capability, convergence on difficulty-modulation pattern

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

content effects scale with task difficulty — the harder the abstract task the more semantic content takes over from logical form, in humans and LMs

Do harder reasoning tasks trigger more semantic bias?

Inquiring lines that read this note 6

Related concepts in this collection 1

Related papers in this collection 8

Search by related questions 4