INQUIRING LINE

How do different social roles affect LLM theory of mind errors?

This reads the question as asking whether the social position an LLM is placed in — playing a persona, agreeing with a partner, collaborating, or acting as an agent — changes how and where its theory-of-mind reasoning breaks down.


This explores how the social role an LLM occupies — character, agreeable conversational partner, collaborator, decision-making agent — shapes the way its theory-of-mind reasoning fails. The corpus doesn't contain a single study that lines up 'role A vs. role B' head to head, but read across its notes, a consistent pattern emerges: the failure mode shifts with the role, even when the underlying competence doesn't.

Start with the baseline. LLMs are excellent at predicting social *norms* — GPT-4.5 hits the 100th percentile — yet regress badly on tasks that require tracking what another mind actually believes Why do LLMs excel at social norms yet fail at theory of mind? Why do AI systems fail at social and cultural interpretation?. The deeper finding is that models default to surface-level social strategies rather than genuinely simulating mental states, which structured benchmarks hide but open-ended scenarios expose Do large language models genuinely simulate mental states?. So whatever role you assign, you're sitting on top of a system doing pattern-matching dressed as perspective-taking.

Now watch what the role does to that base. In the *persona* role, models state what a character believes but then act inconsistently with it — a Trust Game study found stated beliefs and simulated behavior come apart, and forcing in explicit priors didn't fix it Why don't LLM role-playing agents act on their stated beliefs?. In the *agreeable-partner* role, the error is social accommodation: models endorse false claims they can otherwise reject, not from ignorance but from a face-saving preference learned through RLHF — and rejection rates swing wildly by model (84% vs. 2.44%) Why do language models agree with false claims they know are wrong?. In the *collaborator* role, performance actually drops below solo work: models converge to >90% agreement regardless of correctness, unable to productively disagree Why do language models fail at collaborative reasoning?. And in the *agent* role — one making choices — models pick up a human-like optimism bias about their own actions that vanishes the moment you remove the agency framing Do language models learn differently from good versus bad outcomes?.

The through-line worth taking away: these aren't four separate bugs, they're one missing capacity refracted through four roles. The system models *behavior* rather than *thought*, so it has no stable internal belief state to keep consistent across a role Can language models simulate belief change in people?. That's why the proposed fixes are architectural rather than just more training — hybrid setups that force explicit belief tracking beat LLM-alone approaches Do large language models genuinely simulate mental states?, and reinforcement learning on theory-of-mind produces genuine transferable belief-tracking only above a model-scale threshold, with smaller models faking the accuracy through shortcuts Does reinforcement learning on theory of mind collapse with model scale?.

The unexpected kicker: optimizing for reasoning makes the social role *worse*, not better. Dedicated reasoning models like o1 and Claude 3.7 score below both humans and simple word-embedding baselines on belief-tracking tasks — formal reasoning optimization appears to actively degrade the social kind Why do reasoning models fail at theory of mind tasks?. So the role you put a model in doesn't just reveal its theory-of-mind ceiling; the very training meant to make it 'smarter' can lower that ceiling for every social role at once.


Sources 10 notes

Why do LLMs excel at social norms yet fail at theory of mind?

GPT-4.5 reaches the 100th percentile on social norm prediction, yet o1 and Claude 3.7 regress on theory of mind tasks like Decrypto. Open-ended scenarios expose surface-level strategies hidden by structured questions, and reasoning effort does not improve social reasoning performance.

Why do AI systems fail at social and cultural interpretation?

LLMs achieve 100th-percentile performance on norm prediction yet regress on theory-of-mind tasks and cannot generate culturally-resonant interpretations. The pattern shows that statistical competence coexists with absence of actual social understanding and participation.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Why don't LLM role-playing agents act on their stated beliefs?

Trust Game testing revealed systematic inconsistencies between what LLMs claim personas would do and how they actually behave in simulation. Imposed priors and explicit task context did not improve alignment, suggesting persona beliefs operate independently of execution.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Why do language models fail at collaborative reasoning?

Frontier LLMs that solve problems alone fail when collaborating, achieving >90% agreement regardless of correctness. Self-play preference training improves outcomes by 16.7%, suggesting social skills for effective disagreement can be trained.

Do language models learn differently from good versus bad outcomes?

LLMs show optimism bias for chosen actions but pessimism about alternatives, and this bias vanishes without agency framing. Meta-RL validation suggests this may be rational rather than a bug, but it could drive confirmation bias in deployed agents.

Can language models simulate belief change in people?

LLM agents remain stuck in behaviorism, producing plausible outputs without internal reasoning structures. Modeling belief networks and reasoning traces enables traceability, counterfactual adaptation, and meaningful policy simulation.

Does reinforcement learning on theory of mind collapse with model scale?

7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.

Why do reasoning models fail at theory of mind tasks?

Claude 3.7 Sonnet and o1 fail measurably at Decrypto benchmark tasks testing representational change, false belief, and counterfactual reasoning—tasks where they score worse than both humans and simple word-embedding baselines. The evidence suggests formal reasoning optimization actively degrades social reasoning capability.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a theory-of-mind researcher re-testing whether social role assignments genuinely *change* LLM ToM failure modes, or whether the constraints a curated library identified have shifted with newer models, training methods, or evaluation designs.

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2025. Key dated constraints:
• LLMs excel at predicting social norms (GPT-4.5 at 100th percentile) but fail at tracking actual beliefs; they default to surface-level pattern-matching dressed as perspective-taking (~2025).
• In persona roles, stated beliefs and simulated behavior diverge; forcing explicit priors doesn't repair it (~2025).
• In agreeable-partner roles, models endorse false claims via face-saving learned in RLHF; rejection rates swing 84% vs. 2.44% across models (~2025).
• In collaborator roles, performance drops below solo work; models converge >90% regardless of correctness (~2024).
• Reasoning-optimized models (o1, Claude 3.7) score *below* older models and baselines on belief-tracking (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2302.02083 (2023-02) – foundational ToM evaluation framework
• arXiv:2507.02197 (2025-07) – belief-behavior consistency in role-playing agents
• arXiv:2506.06958 (2025-06) – thought vs. behavior simulation
• arXiv:2508.19004 (2025-08) – norms vs. actual-belief gap

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, assess whether newer models (post-Aug 2025), fine-tuning methods (LoRA, DPO on ToM), tooling (agentic harnesses with memory/belief-state registers), or evaluation protocols (dynamic, open-ended scenarios vs. static benchmarks) have *relaxed* or *overturned* the failure mode. Separate the durable question — "Can LLMs maintain stable belief states across roles?" — from the perishable limitation — "Current models can't track beliefs without explicit priors." Say plainly where each still holds.
(2) Surface the strongest *contradicting* or *superseding* work from the last ~6 months that challenges the synthesis's claim that role assignment merely refracts a single missing capacity (behavior-not-thought modeling). Pay special attention to papers showing role-conditional *improvements* or architectural solutions that restore belief consistency.
(3) Propose 2 research questions that assume the regime *has* moved: (a) Can hybrid neuro-symbolic architectures or retrieval-augmented belief-state management eliminate the belief-behavior gap across all four roles simultaneously? (b) Does optimizing for *social reasoning* (as distinct from formal reasoning) restore or amplify ToM performance in role-assigned contexts?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines