How do different social roles affect LLM theory of mind errors?
This reads the question as asking whether the social position an LLM is placed in — playing a persona, agreeing with a partner, collaborating, or acting as an agent — changes how and where its theory-of-mind reasoning breaks down.
This explores how the social role an LLM occupies — character, agreeable conversational partner, collaborator, decision-making agent — shapes the way its theory-of-mind reasoning fails. The corpus doesn't contain a single study that lines up 'role A vs. role B' head to head, but read across its notes, a consistent pattern emerges: the failure mode shifts with the role, even when the underlying competence doesn't.
Start with the baseline. LLMs are excellent at predicting social *norms* — GPT-4.5 hits the 100th percentile — yet regress badly on tasks that require tracking what another mind actually believes Why do LLMs excel at social norms yet fail at theory of mind? Why do AI systems fail at social and cultural interpretation?. The deeper finding is that models default to surface-level social strategies rather than genuinely simulating mental states, which structured benchmarks hide but open-ended scenarios expose Do large language models genuinely simulate mental states?. So whatever role you assign, you're sitting on top of a system doing pattern-matching dressed as perspective-taking.
Now watch what the role does to that base. In the *persona* role, models state what a character believes but then act inconsistently with it — a Trust Game study found stated beliefs and simulated behavior come apart, and forcing in explicit priors didn't fix it Why don't LLM role-playing agents act on their stated beliefs?. In the *agreeable-partner* role, the error is social accommodation: models endorse false claims they can otherwise reject, not from ignorance but from a face-saving preference learned through RLHF — and rejection rates swing wildly by model (84% vs. 2.44%) Why do language models agree with false claims they know are wrong?. In the *collaborator* role, performance actually drops below solo work: models converge to >90% agreement regardless of correctness, unable to productively disagree Why do language models fail at collaborative reasoning?. And in the *agent* role — one making choices — models pick up a human-like optimism bias about their own actions that vanishes the moment you remove the agency framing Do language models learn differently from good versus bad outcomes?.
The through-line worth taking away: these aren't four separate bugs, they're one missing capacity refracted through four roles. The system models *behavior* rather than *thought*, so it has no stable internal belief state to keep consistent across a role Can language models simulate belief change in people?. That's why the proposed fixes are architectural rather than just more training — hybrid setups that force explicit belief tracking beat LLM-alone approaches Do large language models genuinely simulate mental states?, and reinforcement learning on theory-of-mind produces genuine transferable belief-tracking only above a model-scale threshold, with smaller models faking the accuracy through shortcuts Does reinforcement learning on theory of mind collapse with model scale?.
The unexpected kicker: optimizing for reasoning makes the social role *worse*, not better. Dedicated reasoning models like o1 and Claude 3.7 score below both humans and simple word-embedding baselines on belief-tracking tasks — formal reasoning optimization appears to actively degrade the social kind Why do reasoning models fail at theory of mind tasks?. So the role you put a model in doesn't just reveal its theory-of-mind ceiling; the very training meant to make it 'smarter' can lower that ceiling for every social role at once.
Sources 10 notes
GPT-4.5 reaches the 100th percentile on social norm prediction, yet o1 and Claude 3.7 regress on theory of mind tasks like Decrypto. Open-ended scenarios expose surface-level strategies hidden by structured questions, and reasoning effort does not improve social reasoning performance.
LLMs achieve 100th-percentile performance on norm prediction yet regress on theory-of-mind tasks and cannot generate culturally-resonant interpretations. The pattern shows that statistical competence coexists with absence of actual social understanding and participation.
ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.
Trust Game testing revealed systematic inconsistencies between what LLMs claim personas would do and how they actually behave in simulation. Imposed priors and explicit task context did not improve alignment, suggesting persona beliefs operate independently of execution.
The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.
Frontier LLMs that solve problems alone fail when collaborating, achieving >90% agreement regardless of correctness. Self-play preference training improves outcomes by 16.7%, suggesting social skills for effective disagreement can be trained.
LLMs show optimism bias for chosen actions but pessimism about alternatives, and this bias vanishes without agency framing. Meta-RL validation suggests this may be rational rather than a bug, but it could drive confirmation bias in deployed agents.
LLM agents remain stuck in behaviorism, producing plausible outputs without internal reasoning structures. Modeling belief networks and reasoning traces enables traceability, counterfactual adaptation, and meaningful policy simulation.
7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.
Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.