Can theory of mind models generalize across structurally similar scenarios?
This explores whether AI systems that model other minds actually carry that skill over to new-but-similar social situations — or whether they're pattern-matching one scenario type and breaking the moment the surface details shift.
This explores whether theory-of-mind ability in language models is portable — does success on one social scenario transfer to a structurally similar one, or does it evaporate when the wording changes? The corpus answers this more sharply than you might expect, and the headline is discouraging: when researchers tested reasoning models on theory-of-mind tasks, they found longer, more elaborate reasoning traces but *no generalization to similar scenarios* Why do reasoning models struggle with theory of mind tasks?. The effort goes up; the transfer doesn't.
The reason transfer fails turns out to be diagnostic. Several notes converge on the idea that current theory-of-mind success is often pattern-matching wearing the costume of reasoning. Benchmarks can be solved without any real mental-state inference — supervised fine-tuning matches reinforcement learning, and models exploit templated artifacts and distribution biases rather than building genuine belief representations Can language models solve ToM benchmarks without real reasoning?. When the underlying competence is surface pattern recognition, structural similarity isn't enough; the model needs the *same* surface, not just the same shape Do large language models genuinely simulate mental states?. This is the same failure that chain-of-thought shows more broadly: reasoning that looks fluent degrades predictably the moment you shift task, length, or format away from the training distribution Does chain-of-thought reasoning actually generalize beyond training data?.
But here's the twist worth knowing: generalization isn't impossible — it's a function of scale and architecture, not training alone. Under reinforcement learning, 7B models develop *explicit, transferable* belief-tracking, while smaller models hit the same accuracy through shortcut learning that doesn't transfer. The two look identical on the scoreboard and only diverge when you inspect the reasoning traces Does reinforcement learning on theory of mind collapse with model scale?. So 'can it generalize?' depends on whether the model crossed a capacity threshold where genuine belief representation becomes cheaper than memorized shortcuts.
The deeper claim across these notes is that social reasoning is *categorically different* from formal reasoning — and optimizing for the latter can actively damage the former. Reasoning-tuned models like o1 and Claude 3.7 score worse than older models, and worse than simple word-embedding baselines, on false-belief and counterfactual tasks Why do reasoning models fail at theory of mind tasks?. The proposed fix points toward architecture rather than more compute: approaches like Bayesian hypothesis tracking that maintain *multiple simultaneous models of a mind* outperform sequential step-by-step derivation Why do reasoning models struggle with theory of mind tasks?, and hybrid systems that force explicit belief tracking beat LLM-alone setups Do large language models genuinely simulate mental states?. The structural-similarity problem is really a representation problem.
Worth noticing where transfer *does* show up. When models are fine-tuned directly on human psychology-experiment data, they become generalist cognitive predictors that transfer across decision tasks without task-specific design Can language models learn to model human decision making? — but that's modeling aggregate human behavior, not tracking an individual's evolving mental state, which models still fail at over time Can models recognize how individuals reason differently?. The reader leaving here should know the surprising part: the thing that prevents generalization across similar scenarios isn't a lack of reasoning effort — it's that the easiest way to pass a theory-of-mind test is to not do theory of mind at all, and only larger models with the right architecture are forced past that shortcut.
Sources 8 notes
Reasoning models fail to outperform vanilla LLMs on theory of mind tasks, produce longer but unhelpful traces, and show no generalization to similar scenarios. ThoughtTracing's success using shorter Bayesian hypothesis tracking suggests social reasoning demands simultaneous multiple-model maintenance, not sequential derivation.
Supervised fine-tuning matches reinforcement learning performance on ToM tasks, suggesting models exploit structural vulnerabilities rather than develop genuine reasoning. Distribution biases and templated artifacts allow surface-level pattern recognition to achieve competitive generalization.
ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.
Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.
LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.
LLMs struggle to anchor reasoning in temporal gameplay and adapt to evolving strategies. GPT-4o relies on surface lexical cues while DeepSeek-R1 shows early promise, but dynamic style adaptation remains largely insufficient across all models tested.