SYNTHESIS NOTE

Can language models solve ToM benchmarks without real reasoning?

Do current theory-of-mind benchmarks actually measure mental state reasoning, or can models exploit surface patterns and distribution biases to achieve high scores? This matters because it determines whether benchmark performance indicates genuine understanding.

Synthesis note · 2026-02-22 · sourced from Theory of Mind

The dominant narrative around ToM benchmarks assumes that high performance indicates genuine mental state reasoning. This paper systematically challenges that assumption by comparing RL-trained and SFT-trained models across multiple ToM datasets.

The key finding: SFT alone — which optimizes models to reproduce desired outputs from examples without any reasoning-process optimization — achieves "competitive and generalizable performance on these benchmarks, often matching or exceeding RL models in accuracy." If SFT can match RL without any explicit reasoning training, the benchmarks may not be testing what they claim to test.

Several structural vulnerabilities emerge:

Distribution bias. In ExploreToM, 22% of questions have "yes" as the correct answer while only 4% are "no." This creates a strong prior that models can exploit without understanding the content. Answering "yes" to any question is already better than chance.

Templated generation artifacts. The datasets may contain "exploitable patterns, such as surface-level correlations between narrative elements and answers, possibly introduced by templated generation." The logical structure of the stories, even when made more naturalistic through infilling, remains predictable.

Pretraining as hidden capability. General pretraining may equip models with reasoning skills that SFT merely activates, making it impossible to distinguish "learned ToM reasoning" from "pattern matching on familiar narrative structures."

The generalization finding is particularly striking: SFT models generalize to 4th-order ToM and infilled (more naturalistic) stories nearly as well as RL models. This means even increasing complexity or naturalism of the stories doesn't differentiate genuine reasoning from structural exploitation "if the underlying logical structure remains predictable."

This presents a Kosinski dilemma: either accept that these measures are valid (implying LLMs have ToM) or reject that LLMs understand mental states (requiring us to reevaluate the measures themselves). The SFT evidence supports the latter — the measures may be testing structural pattern recognition, not mental state inference.

Inquiring lines that read this note 27

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do benchmark improvements fail to reflect actual reasoning quality?

How does reasoning effort affect AI theory of mind performance?

Is model self-awareness based on genuine introspection or pattern matching?

When should tasks involve human-AI partnership versus full automation?

How do interface design choices shape consciousness attribution?

Can we use folk-psychology without committing to genuine mental states?

How does latent reasoning compare to verbalized chain-of-thought?

How can judges evaluate thinking without seeing the actual thoughts?

Why do language models reinforce false assumptions instead of correcting them?

Can hybrid Bayesian architectures fix language model theory of mind failures?

Do language models develop causal world models or rely on statistical patterns?

Can models track dynamic mental state changes better than static beliefs?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How do structured benchmarks hide theory of mind failures in LLMs?

Can single-axis benchmarks accurately predict agent deployment success?

Why do benchmark scores not capture the true nature of AI systems?

Does AI fluency substitute for verifiable accuracy in human judgment?

Does the Turing test actually measure intelligence or just mimicry?

How do we evaluate AI systems when user perception misleads actual performance?

How do live human evaluations differ from ground-truth benchmarks?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 183 in 2-hop network ·dense cluster Open in graph ↗

Can language models solve ToM benchmarks without… Do foundation models learn world models or task-sp… Can models pass tests while missing the actual gra… Does supervised fine-tuning improve reasoning or j…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do foundation models learn world models or task-specific shortcuts? When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
the ToM benchmark finding is a specific instance: models develop task-specific heuristics for ToM-shaped problems rather than genuine mental state reasoning
Can models pass tests while missing the actual grammar? Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
the same pattern in a different domain: correct performance does not entail the intended mechanism
Does supervised fine-tuning improve reasoning or just answers? Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
SFT on ToM follows the same pattern: scores go up without reasoning quality following

Can language models solve ToM benchmarks without real reasoning?

Inquiring lines that read this note 27

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4