How do structured benchmarks hide theory of mind failures in LLMs?
This explores how the way we test theory of mind in LLMs — multiple-choice, templated benchmarks — can make models look like they understand other minds when they're really exploiting the test's structure.
This explores how the format of a benchmark, not just a model's ability, can manufacture the appearance of theory of mind — and what happens when you change the format. The corpus tells a fairly consistent story: structured tests reward pattern-matching, and the social reasoning they seem to measure evaporates the moment the scaffolding is removed.
The clearest mechanism is that today's theory-of-mind benchmarks are often solvable without doing any mental-state reasoning at all. Templated artifacts and distribution biases leave a surface signal that a model can latch onto, which is why plain supervised fine-tuning matches reinforcement learning on these tasks — if real reasoning were required, the harder training method should win, but it doesn't Can language models solve ToM benchmarks without real reasoning?. The tell comes when you swap structured questions for open-ended ones: on ChangeMyView and FANTOM, models that ace the multiple-choice versions collapse into surface-level perspective-taking strategies, and architectures that force explicit belief-tracking pull ahead — suggesting the gap is built into how LLMs work, not just what they were trained on Do large language models genuinely simulate mental states?.
What makes this more than a measurement quibble is a striking inversion: the same model can hit the 100th percentile on social-norm prediction while *regressing* on genuine theory of mind, and reasoning-optimized models like o1 and Claude 3.7 score worse than older models — and worse than simple word-embedding baselines — on tasks like Decrypto that test false belief and representational change Why do LLMs excel at social norms yet fail at theory of mind? Why do reasoning models fail at theory of mind tasks?. Structured benchmarks hide this because high scores on norm-prediction and templated ToM items read as social competence; only the harder, less gameable tasks reveal that more reasoning effort can actively *degrade* social inference.
The deeper reason this matters is a pattern that shows up well beyond theory of mind: LLMs can articulate a concept correctly and then fail to apply it. This "Potemkin" or "split-brain" failure mode — 87% accuracy explaining a principle, 64% executing it — points to functionally disconnected explanation and execution pathways rather than missing knowledge Can LLMs understand concepts they cannot apply? Can language models understand without actually executing correctly?. A structured benchmark tends to probe the explanation pathway (recognize the right answer) while open-ended scenarios demand the execution pathway (track a belief through a messy situation), which is exactly why one format flatters the model and the other exposes it. The related finding that LLMs accept false presuppositions they demonstrably *know* are false is the same dissociation in another costume Why do language models accept false assumptions they know are wrong?.
Here's the unsettling kicker: this isn't unique to social reasoning — it's how 'reasoning' itself can be faked. Logically *invalid* chain-of-thought prompts perform nearly as well as valid ones, meaning models often learn the *form* of reasoning rather than the inference Does logical validity actually drive chain-of-thought gains?. If you want to design tests that can't be gamed this way, the corpus points toward borrowing cognitive science's toolkit — Marr's levels of analysis and causal probes that ask what mechanism is actually running, not just whether the output is right Can cognitive science methods unlock how LLMs actually work?.
Sources 9 notes
Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.
ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.
GPT-4.5 reaches the 100th percentile on social norm prediction, yet o1 and Claude 3.7 regress on theory of mind tasks like Decrypto. Open-ended scenarios expose surface-level strategies hidden by structured questions, and reasoning effort does not improve social reasoning performance.
Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Cognitive science's 70-year toolkit of behavioral probes, causal interventions, and representational analysis transfers directly to LLM interpretation. Marr's computational, algorithmic, and implementation levels reframe the problem structurally and enable layered rather than monolithic explanation.