SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Model Architecture and Internals Training, RL, and Test-Time Scaling

Can language models solve ToM benchmarks without real reasoning?

Do current theory-of-mind benchmarks actually measure mental state reasoning, or can models exploit surface patterns and distribution biases to achieve high scores? This matters because it determines whether benchmark performance indicates genuine understanding.

Synthesis note · 2026-02-22 · sourced from Theory of Mind
How should researchers navigate LLM reasoning research? Why do LLMs excel at social norms yet fail at theory of mind?

The dominant narrative around ToM benchmarks assumes that high performance indicates genuine mental state reasoning. This paper systematically challenges that assumption by comparing RL-trained and SFT-trained models across multiple ToM datasets.

The key finding: SFT alone — which optimizes models to reproduce desired outputs from examples without any reasoning-process optimization — achieves "competitive and generalizable performance on these benchmarks, often matching or exceeding RL models in accuracy." If SFT can match RL without any explicit reasoning training, the benchmarks may not be testing what they claim to test.

Several structural vulnerabilities emerge:

Distribution bias. In ExploreToM, 22% of questions have "yes" as the correct answer while only 4% are "no." This creates a strong prior that models can exploit without understanding the content. Answering "yes" to any question is already better than chance.

Templated generation artifacts. The datasets may contain "exploitable patterns, such as surface-level correlations between narrative elements and answers, possibly introduced by templated generation." The logical structure of the stories, even when made more naturalistic through infilling, remains predictable.

Pretraining as hidden capability. General pretraining may equip models with reasoning skills that SFT merely activates, making it impossible to distinguish "learned ToM reasoning" from "pattern matching on familiar narrative structures."

The generalization finding is particularly striking: SFT models generalize to 4th-order ToM and infilled (more naturalistic) stories nearly as well as RL models. This means even increasing complexity or naturalism of the stories doesn't differentiate genuine reasoning from structural exploitation "if the underlying logical structure remains predictable."

This presents a Kosinski dilemma: either accept that these measures are valid (implying LLMs have ToM) or reject that LLMs understand mental states (requiring us to reevaluate the measures themselves). The SFT evidence supports the latter — the measures may be testing structural pattern recognition, not mental state inference.

Inquiring lines that use this note as a source 27

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 184 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

current ToM benchmarks may be solvable without explicit mental state reasoning — SFT matches RL suggesting exploitable structural patterns