Can language models solve ToM benchmarks without real reasoning?
Do current theory-of-mind benchmarks actually measure mental state reasoning, or can models exploit surface patterns and distribution biases to achieve high scores? This matters because it determines whether benchmark performance indicates genuine understanding.
The dominant narrative around ToM benchmarks assumes that high performance indicates genuine mental state reasoning. This paper systematically challenges that assumption by comparing RL-trained and SFT-trained models across multiple ToM datasets.
The key finding: SFT alone — which optimizes models to reproduce desired outputs from examples without any reasoning-process optimization — achieves "competitive and generalizable performance on these benchmarks, often matching or exceeding RL models in accuracy." If SFT can match RL without any explicit reasoning training, the benchmarks may not be testing what they claim to test.
Several structural vulnerabilities emerge:
Distribution bias. In ExploreToM, 22% of questions have "yes" as the correct answer while only 4% are "no." This creates a strong prior that models can exploit without understanding the content. Answering "yes" to any question is already better than chance.
Templated generation artifacts. The datasets may contain "exploitable patterns, such as surface-level correlations between narrative elements and answers, possibly introduced by templated generation." The logical structure of the stories, even when made more naturalistic through infilling, remains predictable.
Pretraining as hidden capability. General pretraining may equip models with reasoning skills that SFT merely activates, making it impossible to distinguish "learned ToM reasoning" from "pattern matching on familiar narrative structures."
The generalization finding is particularly striking: SFT models generalize to 4th-order ToM and infilled (more naturalistic) stories nearly as well as RL models. This means even increasing complexity or naturalism of the stories doesn't differentiate genuine reasoning from structural exploitation "if the underlying logical structure remains predictable."
This presents a Kosinski dilemma: either accept that these measures are valid (implying LLMs have ToM) or reject that LLMs understand mental states (requiring us to reevaluate the measures themselves). The SFT evidence supports the latter — the measures may be testing structural pattern recognition, not mental state inference.
Inquiring lines that use this note as a source 27
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How should benchmarks test whether models fit algorithms or patterns?
- Can reasoning benchmarks separate logic from believability?
- Why do reasoning models perform poorly at theory of mind tasks?
- How do surface correlations between narratives and answers mislead benchmark validity?
- What distribution patterns appear across different theory-of-mind datasets?
- How does theory of mind predict success in human-AI partnerships?
- Can we use folk-psychology without committing to genuine mental states?
- How does theory of mind predict who benefits from AI collaboration?
- Why do reasoning models perform worse on theory of mind tasks?
- How can judges evaluate thinking without seeing the actual thoughts?
- Can hybrid Bayesian architectures fix language model theory of mind failures?
- What are the seven components of genuine mental state simulation?
- What language capabilities does fluency on standard benchmarks actually measure?
- Can benchmark performance distinguish surface from structural linguistic knowledge?
- Can high test performance mask a complete absence of understanding?
- Why does reasoning effort fail to improve theory of mind performance?
- Do current math benchmarks measure outcomes or rhetorical plausibility?
- Can theory of mind models generalize across structurally similar scenarios?
- Can models track dynamic mental state changes better than static beliefs?
- Why do AI benchmarks measure accuracy instead of reasoning quality?
- How do structured benchmarks hide theory of mind failures in LLMs?
- Why does additional reasoning effort not improve theory of mind performance?
- Can multi-agent metacognitive decomposition achieve human-level theory of mind?
- Why do benchmark scores not capture the true nature of AI systems?
- Does the Turing test actually measure intelligence or just mimicry?
- Why does reasoning volume fail to improve theory of mind performance?
- How do live human evaluations differ from ground-truth benchmarks?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
the ToM benchmark finding is a specific instance: models develop task-specific heuristics for ToM-shaped problems rather than genuine mental state reasoning
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
the same pattern in a different domain: correct performance does not entail the intended mechanism
-
Does supervised fine-tuning improve reasoning or just answers?
Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
SFT on ToM follows the same pattern: scores go up without reasoning quality following
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?
- A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks
- Evaluating Large Language Models in Theory of Mind Tasks
- On the Reasoning Capacity of AI Models and How to Quantify It
- Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
- Can Large Language Models Reason and Optimize Under Constraints?
Original note title
current ToM benchmarks may be solvable without explicit mental state reasoning — SFT matches RL suggesting exploitable structural patterns