SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Psychology, Society, and Alignment Language, Text, and Discourse

Why do reasoning models struggle with theory of mind tasks?

Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.

Synthesis note · 2026-02-22 · sourced from Theory of Mind
How should researchers navigate LLM reasoning research? Where exactly do reasoning models fail and break? Why do LLMs excel at social norms yet fail at theory of mind?

ThoughtTracing — an SMC-inspired algorithm for mental state tracking — produces its most important finding not through its own performance but through what it reveals about existing reasoning models on ToM tasks.

Four behavioral patterns emerge:

  1. Reasoning models don't consistently outperform vanilla LLMs using chain-of-thought. The extended reasoning training that dramatically improves math and coding does not transfer to social cognition.

  2. They fail to generalize to similar scenarios. A reasoning model that correctly tracks mental states in one ToM scenario fails on structurally similar ones — suggesting pattern matching rather than a generalizable mental state tracking mechanism.

  3. They produce significantly longer reasoning traces for ToM than for factual questions. The model "knows" social reasoning is hard and allocates more tokens to it, but this effort is unproductive.

  4. Reasoning effort (output length) does not correlate with performance. More thinking does not help. This is the sharpest contrast with formal domains where longer chains generally improve accuracy up to a threshold.

These patterns suggest social reasoning is "a different category" from mathematical or programming reasoning "where reasoning models typically excel." The authors explicitly position this as a domain where inference-time reasoning research has been neglected.

The ThoughtTracing algorithm itself offers a clue about what social reasoning requires that formal reasoning doesn't: hypothesis-driven Bayesian tracking of multiple evolving mental state possibilities, weighted by observation likelihood. This is structurally different from derivational chains. Social reasoning requires maintaining multiple simultaneous models of what different agents believe, not sequentially deriving conclusions from premises. The algorithm outperforms reasoning models (including o3-mini and R1) using "significantly shorter reasoning traces" — suggesting efficiency comes from the right structure, not more tokens.

Inquiring lines that use this note as a source 17

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
21 direct connections · 178 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

social reasoning differs categorically from formal reasoning — reasoning effort does not correlate with ToM performance