Do harder reasoning tasks trigger more semantic bias?
Does the difficulty of a logical task determine how much semantic content influences reasoning? This matters because it reveals whether we can isolate 'pure' logical reasoning in benchmarks.
Lampinen et al. observe a difficulty-modulation pattern: content effects are weakest on NLI (a relatively simple inference task), stronger on syllogism validity judgment, and strongest on the Wason selection task — which is the hardest, even for mathematics undergraduates and academic mathematicians who score below 50% on its abstract version. The directional claim is clean: as the logical demands of the task exceed available working-memory or circuit capacity, the system falls back on semantic priors. Both humans and LMs show this fallback in the same direction along the same difficulty axis.
The pattern explains a recurring frustration with reasoning benchmarks. Benchmarks designed to test "purely logical" reasoning still show heavy content sensitivity, and benchmark designers often treat this as a confound to be controlled. The Lampinen finding suggests it cannot be controlled — content sensitivity is more pronounced exactly where the benchmark is most demanding. The harder the task, the more believability bleeds into the result. A reasoning benchmark whose items vary in content believability is partly a believability test, not a logic test, and the harder the items the more this is true.
The connection to Why do LLMs fail at simple deductive reasoning? is partial but illuminating. That note shows LMs and humans diverge on certain reasoning surfaces — long multi-hop versus simple deduction. Lampinen shows they converge on the difficulty-modulation pattern itself, even where their absolute capabilities differ. Both observations can be true: humans and LMs occupy different absolute positions on a difficulty curve, but both slide toward semantic-fallback as difficulty rises.
For False Punditry, the connection is straightforward and uncomfortable. Pundits and LLMs both reach for plausible-sounding content when underlying logic is hard, by the same failure-mode mechanism. The pundit who confidently restates a familiar belief when asked a hard question, and the LLM that confabulates a believable answer when the logic exceeds its circuits, are not analogically similar — they are mechanistically similar. Both are systems whose reasoning capacity has been exceeded and which fall back on a semantic prior that sounds right. Recognizing this similarity is more diagnostically useful than insisting on the difference.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes emotional alignment more effective than logic when reasoning errors are exposed?
- Does irrelevant content degrade reasoning even when it fits the context window?
- What makes a background condition relevant to a specific reasoning task?
- What makes semantic attacks harder to defend against than algorithmic ones?
- What causes reasoning quality to degrade during long research tasks?
- Why does target probability matter more than task logical complexity?
Related concepts in this collection 1
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do LLMs fail at simple deductive reasoning?
LLMs excel at complex multi-hop reasoning across sentences but struggle with trivial deductions humans find obvious. What explains this counterintuitive reversal in capability?
divergence on absolute capability, convergence on difficulty-modulation pattern
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Language models show human-like content effects on reasoning tasks
- Premise Order Matters in Reasoning with Large Language Models
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse
- FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
- Can Large Language Models do Analytical Reasoning?
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning
Original note title
content effects scale with task difficulty — the harder the abstract task the more semantic content takes over from logical form, in humans and LMs