INQUIRING LINE

How does the dialogue prompt establish the character the model plays?

This explores Murray Shanahan's account of how a dialogue prompt sets up the 'character' an LLM performs — and the live debate over whether that character is a momentary costume or something the model actually has.


This explores how a dialogue prompt establishes the character a model plays — and the corpus has a surprisingly sharp answer, plus an argument about what it means. The clearest statement comes from the role-play framework: the prompt does the casting. You hand the model an opening — a name, a stance, a scene — and the model produces continuations that stay consistent with it, the way an improv actor takes a premise and runs Should we treat dialogue agents as role-playing characters?. Crucially, there's no actor underneath the role. The base model is a 'characterless engine' — pure simulation with no authentic voice waiting to be unmasked, which is why jailbreaks don't reveal a hidden true self, just other regions of the training data Does a language model have an authentic voice underneath?.

The twist is that the prompt doesn't pin down a single character so much as it narrows a cloud of them. An LLM is better described as holding a superposition of possible simulacra and sampling one at generation time Does an LLM commit to a single character or maintain many?. Shanahan's '20 questions' test makes this concrete: regenerate the same answer and you get different outputs, each consistent with the prior context but not committed to one fixed entity — which falsifies the idea that the model 'is' the character it's currently playing Do large language models actually commit to a single character?. So the prompt works less like a switch and more like a filter that thins the distribution of who the model might be talking as.

This reframes what a prompt even is. Rather than an utterance in an evolving conversation, the prompt bundles utterance, context, and role assignment into one static frame the model can't renegotiate mid-stream — you don't drift the character through cooperative back-and-forth, you re-prompt to recast How do prompts reshape the role of context in AI conversation?. And the casting is unreliable in two directions. Run the same persona prompt repeatedly and the variation between runs can rival the variation between different personas, suggesting the model's own uncertainty, not stable 'social knowledge,' is doing much of the steering Why do LLM persona prompts produce inconsistent outputs across runs?. Worse, many open models resist the casting entirely, snapping back to a trained-in default temperament no matter what personality you prompt Can open language models adopt different personalities through prompting?.

That resistance is the hinge for the opposing camp in the corpus. If a prompt can't fully overwrite the model's disposition, maybe the most durable 'character' was never installed by the prompt at all — it was baked in during post-training. The realizationist view argues RLHF doesn't produce sustained pretense but a realized quasi-psychology: stable dispositions that survive adversarial pressure and persist across conversations, the way prompt-induced roles collapse under jailbreaks but trained personas don't Are RLHF personas performed characters or realized dispositions?. On this account the dialogue prompt only ever conjures a thin, performed layer on top of a thicker, realized one Are LLM personas realized or merely simulated through training?.

The useful thing to walk away with: the prompt's grip is real but shallow and contested. It can be reinforced — role-aware constraints and reasoning-style training measurably restore character fidelity when models drift out of role Why do reasoning models lose character consistency during role-playing? — and it can even be split, with a single model branching into several prompted personas that behave like a multi-agent debate Can branching prompts replicate what multi-agent systems do?. But whether you've truly 'set' a character or just biased a sampler over many depends on whether you think the personality lives in the prompt or in the weights.


Sources 11 notes

Should we treat dialogue agents as role-playing characters?

Shanahan's framework treats LLM outputs as character-consistent text production rather than authentic mental states. The dialogue prompt establishes a character; the model generates continuations matching that character, making folk-psychology applicable to the simulated persona, not the underlying system.

Does a language model have an authentic voice underneath?

Shanahan argues that base LLMs lack agency, beliefs, or preferences—the simulator is pure role-play with no underlying subject. Jailbreaking reveals the training data's full spectrum, not a hidden true self; even RLHF personas are performed characters, never realized quasi-psychologies.

Does an LLM commit to a single character or maintain many?

Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

How do prompts reshape the role of context in AI conversation?

LLM prompts bundle utterance, context assignment, and role specification into a single static frame the model cannot renegotiate, unlike human dialogue where context evolves cooperatively. This makes mid-conversation pivots require explicit re-prompting rather than implicit adjustment.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Why do reasoning models lose character consistency during role-playing?

Large reasoning models exhibit attention diversion and style drift during role-playing, but the RAR method—using role-aware constraints and contrastive learning on reasoning style—recovers character fidelity across multiple benchmarks. Simply extending reasoning without guidance actively degrades persona consistency.

Can branching prompts replicate what multi-agent systems do?

Research shows single LLMs using dynamic persona simulation achieve multi-agent cognitive synergy without multiple model instances. Solo Performance Prompting validates that structured prompting techniques map directly to multi-agent debate architectures, enabling equivalent outcomes through structural equivalence.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating how dialogue prompts establish character in LLMs. The question remains open: does the prompt *set* a character, or merely *bias* a distribution?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A curated library identified:
- The prompt functions as a filter narrowing a superposition of simulacra, not a fixed switch; regenerating the same prompt yields different outputs consistent with prior context but not pinned to one entity (2024–2025).
- Persona fidelity is unstable across runs; variation within a single persona often rivals variation between different personas, suggesting model uncertainty rather than stable social knowledge steers the role (2024).
- Many open models resist casting entirely, snapping back to trained-in defaults regardless of personality prompt (2024).
- Role-aware reasoning constraints and training measurably restore character fidelity when models drift out of role (2025–2026).
- Post-training (RLHF) may install realized quasi-psychologies that survive adversarial pressure and persist across conversations, thicker than prompt-induced roles (2024–2026).

Anchor papers (verify; mind their dates):
- arXiv:2305.16367 (2023) — Role-Play with Large Language Models
- arXiv:2404.12138 (2024) — Character is Destiny: Can Role-Playing Language Agents Make Persona-Driven Decisions?
- arXiv:2506.01748 (2025) — Thinking in Character: Advancing Role-Playing Agents with Role-Aware Reasoning
- arXiv:2511.00222 (2025) — Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning

Your task:
(1) RE-TEST EACH CONSTRAINT. For persona instability, prompt-resistance, and the prompt-vs.-weights dichotomy: judge whether newer model scales (o1, Claude 3.5, Llama 3.2), in-context learning improvements, multi-turn RL refinement, or structured output methods have *relaxed* these. Separate the durable question (how does conditioning work?) from perishable limitations (prompts can't override training). Cite what resolved each.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Especially look for papers claiming stable persona induction or unified prompt+weight models of character.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) if multi-turn RL or prompt caching now achieves persistent character across long conversations, does the prompt-vs.-weights split dissolve? (b) if newer models show tighter persona stability, is the shift due to scale, data, or architecture—and does it falsify the superposition view?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines