Why do reasoning models struggle with theory of mind tasks?
Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.
ThoughtTracing — an SMC-inspired algorithm for mental state tracking — produces its most important finding not through its own performance but through what it reveals about existing reasoning models on ToM tasks.
Four behavioral patterns emerge:
Reasoning models don't consistently outperform vanilla LLMs using chain-of-thought. The extended reasoning training that dramatically improves math and coding does not transfer to social cognition.
They fail to generalize to similar scenarios. A reasoning model that correctly tracks mental states in one ToM scenario fails on structurally similar ones — suggesting pattern matching rather than a generalizable mental state tracking mechanism.
They produce significantly longer reasoning traces for ToM than for factual questions. The model "knows" social reasoning is hard and allocates more tokens to it, but this effort is unproductive.
Reasoning effort (output length) does not correlate with performance. More thinking does not help. This is the sharpest contrast with formal domains where longer chains generally improve accuracy up to a threshold.
These patterns suggest social reasoning is "a different category" from mathematical or programming reasoning "where reasoning models typically excel." The authors explicitly position this as a domain where inference-time reasoning research has been neglected.
The ThoughtTracing algorithm itself offers a clue about what social reasoning requires that formal reasoning doesn't: hypothesis-driven Bayesian tracking of multiple evolving mental state possibilities, weighted by observation likelihood. This is structurally different from derivational chains. Social reasoning requires maintaining multiple simultaneous models of what different agents believe, not sequentially deriving conclusions from premises. The algorithm outperforms reasoning models (including o3-mini and R1) using "significantly shorter reasoning traces" — suggesting efficiency comes from the right structure, not more tokens.
Inquiring lines that use this note as a source 17
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do reasoning models perform poorly at theory of mind tasks?
- Why do reasoning models perform worse on theory of mind tasks?
- Why does reasoning effort fail to improve theory of mind performance?
- Does formal reasoning training actively degrade social reasoning ability?
- What makes reasoning models worse at understanding people?
- Do longer reasoning traces actually improve theory of mind accuracy?
- Can theory of mind models generalize across structurally similar scenarios?
- What makes social reasoning fundamentally different from mathematical reasoning?
- Why does increasing reasoning not improve AI social reasoning performance?
- How do emotional and social simulations enable better hypothetical reasoning?
- How much does extended thinking actually improve model reasoning ability?
- Why does additional reasoning effort not improve theory of mind performance?
- Can reasoning scaffolds help with nuanced judgment tasks like empathy?
- Why might social reasoning work differently than formal logical reasoning?
- Why does reasoning training improve math but hurt knowledge tasks?
- Why does reasoning volume fail to improve theory of mind performance?
- What makes social reasoning fundamentally different from formal logical reasoning?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
the ToM finding inverts the usual pattern: on ToM, reasoning models produce LONGER traces that DON'T help, while ThoughtTracing uses shorter traces that DO help
-
Why do reasoning models fail at theory of mind tasks?
Recent LLMs optimized for formal reasoning dramatically underperform at social reasoning tasks like false belief and recursive belief modeling. This explores whether reasoning optimization actively degrades the ability to track other agents' mental states.
independent confirmation from Decrypto: the formal-reasoning ↔ social-reasoning tension is robust across multiple benchmarks
-
Do large language models use one reasoning style or many?
Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.
game-based strategic reasoning is similarly fragmented; social/ToM reasoning is yet another non-transferable domain
-
When does explicit reasoning actually help model performance?
Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
ToM extends the domain taxonomy: formal (reasoning helps) vs. nuanced judgment (reasoning hurts) vs. social (reasoning is irrelevant)
-
Do iterative refinement methods suffer from overthinking?
Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?
the ToM finding is a cross-domain instance of the overthinking pattern: reasoning models allocate more tokens to social reasoning but the additional effort is unproductive, confirming that sequential token extension fails outside derivational domains
-
Can language models track how minds change during persuasion?
Do LLMs understand evolving mental states in persuasive dialogue, or do they only capture fixed attitudes? This explores whether models can update their reasoning as a person's beliefs shift across conversation turns.
the static/dynamic split provides a finer-grained taxonomy: social reasoning is not uniformly hard but splits into static (near-human) and dynamic (significantly worse), with CoT helping strategy prediction but not dynamic mental state tracking
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models
- A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks
- Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?
- Theory of Mind abilities of Large Language Models in Human-Robot Interaction : An Illusion?
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research
- MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems
- PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues
Original note title
social reasoning differs categorically from formal reasoning — reasoning effort does not correlate with ToM performance