SYNTHESIS NOTE

What breaks when humans and AI models misunderstand each other?

Explores whether misalignment in mutual theory of mind between humans and AI creates only communication problems or produces material consequences in autonomous action and collaboration.

Synthesis note · 2026-02-22 · sourced from Theory of Mind

Design fictions probing operationalized mutual theory of mind (MToM) between humans and AI agents reveal that ToM in human-AI interaction is not a one-directional problem. Three layers of mutual modeling must be maintained simultaneously:

Human's understanding of what the AI knows about them. Users need to interrogate the AI's theory of mind model — "what does it know about me?" — and this knowledge shapes how they interact with the system.
AI's representation of the human's mental model of the AI. The AI must model not just the human but the human's model of the AI's capabilities. Problems arise "when a human's mental model of an AI's capabilities doesn't align with the AI's actual capabilities" — people misapply AI to domains it wasn't designed for.
Bidirectional updating through interaction. Both parties must update their models as interaction progresses. The AI learns about the user through both "chat space" (conversation) and "artifact space" (work products). The human calibrates their trust through explanations of what the AI did and why.

When these layers misalign, the consequences are material, not just communicative. Design fictions show AI agents acting on users' behalf based on predictive models — writing code, responding to messages, executing workflows. A faulty MToM doesn't just cause miscommunication; it causes incorrect autonomous action.

The design implications are specific:

Users need signifiers of model presence — indicators that the AI is building and using a model of them
Users need the ability to query and correct the AI's user model
When MToM-infused AI acts on the user's behalf, recipients need signifiers that they're interacting with an AI, not the human
Explanations are crucial for trust calibration — both what the system did and why

The wider adoption scenario (MToM within an organization) shows how these dynamics scale: MToM can "reshape work practices by streamlining communications and delivering the right information to the right people at the right time" — but every efficiency gain depends on model accuracy, and every inaccuracy has downstream consequences.

Empirical evidence from a Bayesian IRT study of human-AI synergy (n=667) provides quantitative grounding for MToM's importance: Theory of Mind predicts collaborative performance with AI but not solo performance. Users with stronger perspective-taking achieve superior collaboration — and critically, moment-to-moment fluctuations in ToM (not just stable individual differences) influence AI response quality within sessions. This confirms that MToM is not merely a design-fiction aspiration but a measurable cognitive mechanism with quantifiable effects on collaboration outcomes. See Does theory of mind predict who thrives in AI collaboration?.

Inquiring lines that read this note 22

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What mechanisms enable AI systems to generate and spread false beliefs?

How do false agreements emerge differently from genuine bilateral convergence?

How does latent reasoning compare to verbalized chain-of-thought?

How does anomalous knowledge state connect to the gulf of envisioning?

Is embodied interaction necessary for language meaning and genuine agency?

When both anthropomorphism and anthropomimesis occur together, which should we address first?

How do multi-agent systems achieve genuine cooperation and reasoning?

How do goal representations differ between human and AI teams?

When should tasks involve human-AI partnership versus full automation?

Can language model hallucination be prevented or only managed?

What does the distributed cognition framework reveal about AI hallucination versus human-AI co-construction?

How do language models establish social grounding in human dialogue?

Why do conventional mental models fail when applied to AI interaction?

How can AI alignment serve diverse human preferences at scale?

Can bidirectional model updating between humans and AI reduce misalignment?

Can AI-generated outputs constitute genuine knowledge or valid claims?

Can AI systems develop genuine social understanding without embodiment?

How can language models sustain linguistic synchrony and intersubjectivity during dialogue?

What happens when comfortable AI interactions replace the productive friction of disagreement?

How does reasoning effort affect AI theory of mind performance?

Can multi-agent metacognitive decomposition achieve human-level theory of mind?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 139 in 2-hop network ·medium cluster Open in graph ↗

What breaks when humans and AI models misunderst… Do language models actually build shared understan… Do users worldwide trust confident AI outputs even… Why do speakers need to actively calibrate shared …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do language models actually build shared understanding in conversation? When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.
MToM is the design-level solution: if models presume rather than build common ground, the architecture must externalize the common-ground-building process
Do users worldwide trust confident AI outputs even when wrong? Explores whether the tendency to over-rely on confident language model outputs transcends language and culture. Understanding this pattern is critical for designing safer human-AI interaction across diverse linguistic contexts.
MToM misalignment is amplified by overreliance: users who don't interrogate the AI's model of them assume it's correct
Why do speakers need to actively calibrate shared reference? Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.
MToM operationalizes calibrated shared reference in the human-AI context

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

mutual theory of mind between humans and AI requires bidirectional model updating and creates material consequences from misalignment

What breaks when humans and AI models misunderstand each other?

Inquiring lines that read this note 22

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4