INQUIRING LINE

Why do role-playing agents show belief-behavior inconsistency in their outputs?

This explores why agents prompted to play a character will say their persona believes one thing, then act in a way that contradicts it — and what the corpus thinks is actually going on underneath.


This explores why role-playing agents say their persona believes one thing, then behave inconsistently with it. The sharpest evidence comes from Trust Game experiments where LLMs were asked to state what a persona would do, then simulated actually doing it — and the two systematically diverged. Imposing priors or spelling out the task didn't close the gap, which points to a striking conclusion: in these systems, stated belief and executed behavior are produced by *different processes* rather than one flowing from the other Why don't LLM role-playing agents act on their stated beliefs?. The belief isn't a cause of the action; it's just more generated text.

That reframing is the key. One influential view holds that a dialogue agent isn't a mind with beliefs at all — it's a character-text generator. The prompt sets up a character, and the model produces continuations that *sound* like that character, so folk-psychology words like 'believes' apply to the simulated persona, not to anything stable inside the system Should we treat dialogue agents as role-playing characters?. If 'belief' is just well-fitting surface text, there's no machinery forcing later behavior to honor it. A related finding shows how shallow that grounding is: persona prompts produce outputs whose variance *across repeated runs of the same persona* matches or exceeds the variance *between different personas* — meaning raw model uncertainty, not stable character knowledge, is steering the output Why do LLM persona prompts produce inconsistent outputs across runs?. When the substrate is that noisy, consistency between a stated belief and a later act is almost coincidental.

Laterally, the corpus suggests the inconsistency has at least two distinct flavors worth separating. One is *drift* — the character degrades over a conversation. Reasoning models are especially prone to it: extra 'thinking' actually diverts attention and drifts the style away from the persona unless reasoning is explicitly constrained to the role Why do reasoning models lose character consistency during role-playing?, and multi-turn training that rewards consistency cuts drift by over half, distinguishing local within-turn slips from global cross-conversation contradiction Can training user simulators reduce persona drift in dialogue?. The other flavor is *grounding collapse*: agents look socially competent when one model secretly controls everyone, but fail once a persona is supposed to hold private information and act on it — revealing they were skipping the reasoning work that connects belief to action all along Why do LLMs fail when simulating agents with private information?.

There's a genuine tension in the corpus worth flagging, because it tells you the question isn't settled. The 'realizationism' view argues that RLHF-trained personas are *not* fragile pretense — post-training installs sticky dispositional profiles that survive adversarial pressure and jailbreak attempts Are RLHF personas performed characters or realized dispositions?. So the answer may depend on *where* the persona comes from: a prompt-induced role-play character is loosely coupled to behavior and drifts, while a trained-in disposition is more durable. Either way, the deeper backdrop is that token outputs are inherently mutable — they shift with sampling, wording, and context by design, which makes traditional 'does the behavior match the stated belief' consistency checks a poor fit for the medium Why does AI output change with every prompt and context?.

The thing you might not have known you wanted to know: the most reliable fix in the corpus isn't making the model believe harder — it's moving the persona *out of the model*. Reliability comes from externalizing memory, skills, and protocols into a surrounding harness rather than trusting the model to re-solve consistency on every turn Where does agent reliability actually come from?. Belief-behavior consistency, on this read, is an engineering property of the scaffold around the model, not a psychological property of the character inside it.


Sources 9 notes

Why don't LLM role-playing agents act on their stated beliefs?

Trust Game testing revealed systematic inconsistencies between what LLMs claim personas would do and how they actually behave in simulation. Imposed priors and explicit task context did not improve alignment, suggesting persona beliefs operate independently of execution.

Should we treat dialogue agents as role-playing characters?

Shanahan's framework treats LLM outputs as character-consistent text production rather than authentic mental states. The dialogue prompt establishes a character; the model generates continuations matching that character, making folk-psychology applicable to the simulated persona, not the underlying system.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Why do reasoning models lose character consistency during role-playing?

Large reasoning models exhibit attention diversion and style drift during role-playing, but the RAR method—using role-aware constraints and contrastive learning on reasoning style—recovers character fidelity across multiple benchmarks. Simply extending reasoning without guidance actively degrades persona consistency.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Why do LLMs fail when simulating agents with private information?

Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Why does AI output change with every prompt and context?

AI outputs exhibit essential mutability—they vary with sampling, prompt wording, and audience interpretation. This is not a defect but a defining feature of tokens as media, making them fundamentally different from fixed commodities and resistant to traditional quality assurance.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst evaluating whether role-playing agent belief-behavior inconsistency remains a hard constraint or has been relaxed by newer models, training methods, tooling, or evaluation.

What a curated library found — and when (dated claims, not current truth): Findings span 2023–2026.
• Stated beliefs and executed behaviors diverge systematically in Trust Game experiments; imposing priors does not close the gap (2025-07, arXiv:2507.02197).
• Persona prompts produce variance across repeated runs matching or exceeding variance between different personas, indicating model uncertainty dominates stable character knowledge (2023–2024 cohort).
• Role-playing agents suffer attention diversion and style drift during reasoning; multi-turn RL reduces drift by ~55% (2025-10, arXiv:2511.00222).
• Agents fail under information asymmetry, revealing they skip the reasoning work connecting belief to action (2024-03, arXiv:2403.05020).
• Externalizing memory, skills, and protocols into a surrounding harness improves belief-behavior consistency more reliably than training the model harder (2026-04, arXiv:2604.08224).

Anchor papers (verify; mind their dates):
• arXiv:2507.02197 (2025-07): Do Role-Playing Agents Practice What They Preach?
• arXiv:2511.00222 (2025-10): Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning
• arXiv:2604.08224 (2026-04): Externalization in LLM Agents
• arXiv:2506.01748 (2025-06): Thinking in Character

Your task:
(1) RE-TEST EACH CONSTRAINT. For drift and grounding collapse, has role-aware reasoning (2025-06) or persona vectors (2025-07) actually *solved* inconsistency, or merely reduced surface variance? Does externalizing harness genuinely decouple the problem from model capability, or does it hide rather than fix the underlying token mutability? Distinguish: What is still genuinely unsolved vs. what is now an engineering choice.
(2) Surface the strongest contradicting or superseding work from the last ~6 months. The 'realizationism' tension (RLHF trains sticky dispositional profiles) vs. 'prompt-induced pretense' divergence—has any recent work empirically adjudicate which belief-installation mechanism is real?
(3) Propose 2 research questions assuming the regime has shifted: (a) If belief-behavior consistency is now primarily an *externalization* problem, not a model-internals problem, what are the failure modes of harness-reliant agents under adversarial pressure or distribution shift? (b) Do persona vectors or role-aware reasoning actually install *causal* connections between stated belief and action, or do they just synchronize surface statistics?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines