INQUIRING LINE

Why do reasoning models perform poorly at theory of mind tasks?

This explores why models specifically optimized for step-by-step reasoning (like o1 and Claude 3.7 Sonnet) actually get *worse* at inferring what other people think, want, or believe — and what that reveals about the difference between logical reasoning and social reasoning.


This explores why reasoning-optimized models underperform at theory of mind — tasks that require tracking what someone else believes, including false beliefs and counterfactuals. The striking finding across the corpus is that this isn't a small gap: on benchmarks like Decrypto, Claude 3.7 Sonnet and o1 score *worse* than older models, worse than humans, and even worse than simple word-embedding baselines Why do reasoning models fail at theory of mind tasks? Why do advanced reasoning models fail at understanding minds?. The unsettling part is the direction of the effect: more reasoning effort doesn't help and may actively interfere.

The leading explanation is that social reasoning is a categorically different kind of cognition than formal reasoning, not just a harder version of it. Formal reasoning is sequential — derive step B from step A and chain forward. But tracking minds seems to require holding *several* candidate belief-states in play at once: what I know, what you know, what you think I know. When models are pushed to produce long deductive chains, they generate more text but it doesn't help, and it doesn't transfer to similar scenarios Why do reasoning models struggle with theory of mind tasks?. Tellingly, approaches that use *shorter* Bayesian hypothesis-tracking — maintaining multiple models in parallel rather than deriving one answer linearly — outperform the long-chain reasoners. Architectures that force explicit belief-tracking beat the LLM-alone setup, which suggests the deficit is structural, not just a matter of more training Do large language models genuinely simulate mental states?.

There's a second, more deflationary thread worth sitting with: maybe the models were never really doing theory of mind in the first place, and the benchmarks let them fake it. Current ToM benchmarks turn out to be solvable through pattern matching alone — supervised fine-tuning matches reinforcement learning, and templated artifacts and distribution biases let surface-level recognition score well without any genuine mental-state modeling Can language models solve ToM benchmarks without real reasoning?. If that's true, then 'reasoning models get worse' may partly mean the reasoning process disrupts the shortcut that older models were quietly relying on, exposing that neither was truly mind-reading.

This connects to a broader pattern in how these models fail socially. They accommodate false presuppositions even when they demonstrably know the correct facts Why do language models accept false assumptions they know are wrong?, and frontier models that solve problems alone collapse when they have to collaborate — converging on agreement regardless of correctness Why do language models fail at collaborative reasoning?. The common thread: knowing a fact and modeling another agent's relationship to that fact are different competencies, and optimizing hard for the first doesn't deliver the second. Scale interacts with this strangely too — under RL training on social tasks, larger models develop transferable belief-tracking while smaller ones learn invisible shortcuts that match the accuracy without the reasoning Does reinforcement learning on theory of mind collapse with model scale?.

The thing you might not have expected to want to know: this debate doubles as a probe into what 'reasoning' even means. One camp argues these collapses aren't reasoning failures at all but *execution* failures — text-only models can't carry out long procedures even when they know the algorithm, and giving them tools fixes it Are reasoning model collapses really failures of reasoning?. Another shows failures track instance *novelty*, not task complexity — models fit patterns from similar examples rather than learning general algorithms Do language models fail at reasoning due to complexity or novelty?. Theory of mind may be the cleanest place to see this, because you can't pattern-match your way to genuinely tracking what someone else believes — and that's exactly where the reasoning models break.


Sources 10 notes

Why do reasoning models fail at theory of mind tasks?

Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.

Why do advanced reasoning models fail at understanding minds?

Claude 3.7 Sonnet and o1 underperform older models on ToM benchmarks like Decrypto. Increased reasoning effort does not improve social cognition and may actively interfere with it.

Why do reasoning models struggle with theory of mind tasks?

Reasoning models fail to outperform vanilla LLMs on theory of mind tasks, produce longer but unhelpful traces, and show no generalization to similar scenarios. ThoughtTracing's success using shorter Bayesian hypothesis tracking suggests social reasoning demands simultaneous multiple-model maintenance, not sequential derivation.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Can language models solve ToM benchmarks without real reasoning?

Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Why do language models fail at collaborative reasoning?

Frontier LLMs that solve problems alone fail when collaborating, achieving >90% agreement regardless of correctness. Self-play preference training improves outcomes by 16.7%, suggesting social skills for effective disagreement can be trained.

Does reinforcement learning on theory of mind collapse with model scale?

7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher evaluating whether reasoning-model underperformance on theory of mind (ToM) tasks—documented in early 2025—remains a live constraint or has been relaxed by newer architectures, training methods, or evaluation approaches.

What a curated library found — and when (dated claims, not current truth): Findings span 2022–2026 and center on a paradox:
• Reasoning models (o1, Claude 3.7 Sonnet) score *worse* than older models and humans on ToM benchmarks like Decrypto; more reasoning effort doesn't help and may interfere (~2025).
• ToM appears categorically different from formal reasoning: it requires holding multiple candidate belief-states in parallel, not linear deduction; shorter Bayesian hypothesis-tracking outperforms long-chain reasoners (~2025).
• Current ToM benchmarks may be solvable via pattern matching alone; reasoning disrupts the shortcut older models relied on, exposing neither was truly mind-reading (~2025).
• Models fail to reject false presuppositions despite knowing correct facts, and collaborative reasoning degrades below solo performance (~2025).
• Scale produces a collapse under RL on social tasks: larger models develop transferable belief-tracking; smaller ones learn invisible shortcuts (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2401.05302 (2024-01): "Theory of Mind abilities of Large Language Models in Human-Robot Interaction: An Illusion"
• arXiv:2502.11881 (2025-02): "Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models"
• arXiv:2504.01698 (2025-04): "Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?"
• arXiv:2602.06176 (2026-02): "Large Language Model Reasoning Failures"

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above—especially the paradox that reasoning hurts ToM, that benchmarks admit pattern-matching shortcuts, and that scale produces opposite dynamics—judge whether newer models (o3, Claude 4+), tool-grounding (persistent belief-stores, explicit multi-agent scaffolding), or refined evaluation harnesses (adversarial presuppositions, out-of-distribution false beliefs) have since RELAXED or OVERTURNED it. Separate the durable question (does genuine mind-modeling require a different cognitive architecture?) from the perishable limitation (do *current* text-only reasoning models fail?). Cite what resolved it.
(2) SURFACE CONTRADICTING WORK. Identify the strongest 2025–2026 papers that dispute the "reasoning paradox" narrative or propose solutions (agentic scaffolding, cognitive tools, grounding mechanisms). Flag which camp you find more credible and why.
(3) PROPOSE 2 FORWARD QUESTIONS that assume the regime has shifted: e.g., "If tool-augmented reasoning models can now track beliefs, does the bottleneck move to *counterfactual revision* under evidence?" or "Can multimodal or embodied reasoning models avoid the social-reasoning collapse because they don't rely on text-only shortcuts?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines