SYNTHESIS NOTE

Why do reasoning models struggle with theory of mind tasks?

Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.

Synthesis note · 2026-02-22 · sourced from Theory of Mind

ThoughtTracing — an SMC-inspired algorithm for mental state tracking — produces its most important finding not through its own performance but through what it reveals about existing reasoning models on ToM tasks.

Four behavioral patterns emerge:

Reasoning models don't consistently outperform vanilla LLMs using chain-of-thought. The extended reasoning training that dramatically improves math and coding does not transfer to social cognition.
They fail to generalize to similar scenarios. A reasoning model that correctly tracks mental states in one ToM scenario fails on structurally similar ones — suggesting pattern matching rather than a generalizable mental state tracking mechanism.
They produce significantly longer reasoning traces for ToM than for factual questions. The model "knows" social reasoning is hard and allocates more tokens to it, but this effort is unproductive.
Reasoning effort (output length) does not correlate with performance. More thinking does not help. This is the sharpest contrast with formal domains where longer chains generally improve accuracy up to a threshold.

These patterns suggest social reasoning is "a different category" from mathematical or programming reasoning "where reasoning models typically excel." The authors explicitly position this as a domain where inference-time reasoning research has been neglected.

The ThoughtTracing algorithm itself offers a clue about what social reasoning requires that formal reasoning doesn't: hypothesis-driven Bayesian tracking of multiple evolving mental state possibilities, weighted by observation likelihood. This is structurally different from derivational chains. Social reasoning requires maintaining multiple simultaneous models of what different agents believe, not sequentially deriving conclusions from premises. The algorithm outperforms reasoning models (including o3-mini and R1) using "significantly shorter reasoning traces" — suggesting efficiency comes from the right structure, not more tokens.

Inquiring lines that read this note 17

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How does reasoning effort affect AI theory of mind performance?

When do additional thinking tokens stop improving reasoning performance?

How much does extended thinking actually improve model reasoning ability?

How do training data properties shape reasoning capability development?

Why does reasoning training improve math but hurt knowledge tasks?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

21 direct connections · 181 in 2-hop network ·medium cluster Open in graph ↗

Why do reasoning models struggle with theory of … Why do correct reasoning traces contain fewer toke… Why do reasoning models fail at theory of mind tas… Do large language models use one reasoning style o… When does explicit reasoning actually help model p… Do iterative refinement methods suffer from overth… Can language models track how minds change during …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do correct reasoning traces contain fewer tokens? In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
the ToM finding inverts the usual pattern: on ToM, reasoning models produce LONGER traces that DON'T help, while ThoughtTracing uses shorter traces that DO help
Why do reasoning models fail at theory of mind tasks? Recent LLMs optimized for formal reasoning dramatically underperform at social reasoning tasks like false belief and recursive belief modeling. This explores whether reasoning optimization actively degrades the ability to track other agents' mental states.
independent confirmation from Decrypto: the formal-reasoning ↔ social-reasoning tension is robust across multiple benchmarks
Do large language models use one reasoning style or many? Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.
game-based strategic reasoning is similarly fragmented; social/ToM reasoning is yet another non-transferable domain
When does explicit reasoning actually help model performance? Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
ToM extends the domain taxonomy: formal (reasoning helps) vs. nuanced judgment (reasoning hurts) vs. social (reasoning is irrelevant)
Do iterative refinement methods suffer from overthinking? Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?
the ToM finding is a cross-domain instance of the overthinking pattern: reasoning models allocate more tokens to social reasoning but the additional effort is unproductive, confirming that sequential token extension fails outside derivational domains
Can language models track how minds change during persuasion? Do LLMs understand evolving mental states in persuasive dialogue, or do they only capture fixed attitudes? This explores whether models can update their reasoning as a person's beliefs shift across conversation turns.
the static/dynamic split provides a finer-grained taxonomy: social reasoning is not uniformly hard but splits into static (near-human) and dynamic (significantly worse), with CoT helping strategy prediction but not dynamic mental state tracking

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

social reasoning differs categorically from formal reasoning — reasoning effort does not correlate with ToM performance

Why do reasoning models struggle with theory of mind tasks?

Inquiring lines that read this note 17

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4