SYNTHESIS NOTE
Psychology, Society, and Alignment

What breaks when humans and AI models misunderstand each other?

Explores whether misalignment in mutual theory of mind between humans and AI creates only communication problems or produces material consequences in autonomous action and collaboration.

Synthesis note · 2026-02-22 · sourced from Theory of Mind
How should researchers navigate LLM reasoning research? Why do LLMs excel at social norms yet fail at theory of mind?

Design fictions probing operationalized mutual theory of mind (MToM) between humans and AI agents reveal that ToM in human-AI interaction is not a one-directional problem. Three layers of mutual modeling must be maintained simultaneously:

  1. Human's understanding of what the AI knows about them. Users need to interrogate the AI's theory of mind model — "what does it know about me?" — and this knowledge shapes how they interact with the system.

  2. AI's representation of the human's mental model of the AI. The AI must model not just the human but the human's model of the AI's capabilities. Problems arise "when a human's mental model of an AI's capabilities doesn't align with the AI's actual capabilities" — people misapply AI to domains it wasn't designed for.

  3. Bidirectional updating through interaction. Both parties must update their models as interaction progresses. The AI learns about the user through both "chat space" (conversation) and "artifact space" (work products). The human calibrates their trust through explanations of what the AI did and why.

When these layers misalign, the consequences are material, not just communicative. Design fictions show AI agents acting on users' behalf based on predictive models — writing code, responding to messages, executing workflows. A faulty MToM doesn't just cause miscommunication; it causes incorrect autonomous action.

The design implications are specific:

The wider adoption scenario (MToM within an organization) shows how these dynamics scale: MToM can "reshape work practices by streamlining communications and delivering the right information to the right people at the right time" — but every efficiency gain depends on model accuracy, and every inaccuracy has downstream consequences.

Empirical evidence from a Bayesian IRT study of human-AI synergy (n=667) provides quantitative grounding for MToM's importance: Theory of Mind predicts collaborative performance with AI but not solo performance. Users with stronger perspective-taking achieve superior collaboration — and critically, moment-to-moment fluctuations in ToM (not just stable individual differences) influence AI response quality within sessions. This confirms that MToM is not merely a design-fiction aspiration but a measurable cognitive mechanism with quantifiable effects on collaboration outcomes. See Does theory of mind predict who thrives in AI collaboration?.

Inquiring lines that use this note as a source 22

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
18 direct connections · 135 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

mutual theory of mind between humans and AI requires bidirectional model updating and creates material consequences from misalignment