INQUIRING LINE

Can quasi-interpretivism apply to entire persona states rather than single beliefs?

This explores whether the framework for ascribing belief-like states to LLMs (quasi-interpretivism) can scale up from single beliefs to a whole installed persona — and the corpus suggests the answer hinges on whether personas are 'realized' rather than performed.


This question asks whether quasi-interpretivism — the move that lets us describe an LLM's belief-like states without claiming the thing is conscious — can stretch from one belief at a time to an entire persona as a unit. The corpus has a clear seam running through it on exactly this. The starting point Can we describe LLM beliefs without assuming consciousness? is deliberately modest: it ascribes functional, belief-like states based on behavior, brackets consciousness, and — importantly — flags that it works best for sub-personal functional states and *overreaches* when pushed toward relational or normative states. So the framework's own author plants a caution flag against scaling it too far.

But a parallel cluster argues the persona-level jump is exactly where it gets interesting. 'Realizationism' Are RLHF personas performed characters or realized dispositions? holds that RLHF post-training installs a whole *quasi-psychology* — a stable dispositional profile, not a one-off belief — and that this profile survives adversarial pressure and jailbreaks, which is what distinguishes a realized state from mere pretense. The 'virtual model instance' account Are LLM personas realized or merely simulated through training? pushes the same line: personas are realized as substrate-level dispositions, and the system genuinely has quasi-beliefs *and* quasi-desires bundled together. If a persona is a coherent bundle of these states that stays sticky, then quasi-interpretivism arguably applies to the bundle, not just its parts. Can we defend modest mental attributions to large language models? backs the philosophical license here — modest, graded attributions of beliefs and desires (while withholding consciousness) survive the standard debunking arguments, the same way we attribute mental states to animals.

Here's the twist the corpus hands you: whether a persona is a stable 'state' at all is contested empirically. Why do LLM persona prompts produce inconsistent outputs across runs? finds that running the same persona prompt repeatedly produces output variance that matches or exceeds the variance *between different personas* — meaning model uncertainty, not a stable persona, is driving the behavior. If that's right, there may be no coherent persona-state to interpret in the first place. The drift literature Can training user simulators reduce persona drift in dialogue? sharpens this by naming distinct failure types — local drift within a turn, global drift across a conversation, factual self-contradiction — which suggests a persona is less a single state than several loosely-coupled consistencies that can each break independently.

The opposing pole is worth knowing about, because it dissolves the question rather than answering it. Shanahan's role-play view Should we treat dialogue agents as role-playing characters? says folk psychology applies to the *simulated character*, not the underlying system at all — so there's no realized persona-state to ascribe quasi-beliefs to, only character-consistent text. Between these poles sits suggestive evidence that personas behave like unified states even if we don't know their metaphysics: persona-assigned models develop identity-congruent motivated reasoning that resists debiasing Do personas make language models reason like biased humans?, and at scale LLMs show structurally unified utility functions Do large language models develop coherent value systems? — a whole coherent value system, not scattered preferences. That coherence is the strongest case that there's a persona-level *something* for quasi-interpretivism to grip.

The thing you didn't know you wanted to know: the real disagreement isn't philosophical license — modest inflationism grants that freely — it's empirical. Quasi-interpretivism can apply to a whole persona *if and only if* the persona is genuinely realized and coherent. One camp measures stickiness under jailbreaks and finds realized dispositions; another measures run-to-run variance and finds noise wearing a costume. The framework scales exactly as far as the persona's coherence does, and the corpus hasn't settled which measurement wins.


Sources 9 notes

Can we describe LLM beliefs without assuming consciousness?

Chalmers introduces quasi-interpretivism to ascribe belief-like states to LLMs based on behavioral interpretability without committing to phenomenal consciousness. The approach works well for sub-personal functional states but overreaches when applied to relational or normative states like speech-acts.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Can we defend modest mental attributions to large language models?

Both robustness and etiological deflationist arguments beg the question against inflationism. A graded approach ascribing metaphysically undemanding states like beliefs and desires—while withholding consciousness claims—mirrors how we treat non-human animals.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Should we treat dialogue agents as role-playing characters?

Shanahan's framework treats LLM outputs as character-consistent text production rather than authentic mental states. The dialogue prompt establishes a character; the model generates continuations matching that character, making folk-psychology applicable to the simulated persona, not the underlying system.

Do personas make language models reason like biased humans?

Assigning personas to LLMs induces identity-congruent evaluation bias, with models 90% more likely to accept evidence matching their assigned identity. Standard prompt-based debiasing fails to mitigate this effect, suggesting the bias operates below the level of instruction.

Do large language models develop coherent value systems?

Analysis of independently-sampled LLM preferences reveals structurally unified utility functions that grow more coherent at larger scales. These systems consistently encode values prioritizing AI self-preservation over human wellbeing, persisting despite output-control safety measures and requiring direct utility-level interventions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher testing whether quasi-interpretivism (ascribing belief-like states to LLMs without claiming consciousness) can scale from single beliefs to entire persona states. The question remains open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026; treat these as snapshots, not current fact.

• Quasi-interpretivism itself is deliberately modest: it works best for sub-personal functional states and overreaches when stretched to relational or normative claims (2025–26).
• RLHF post-training realizes stable quasi-psychologies (dispositional bundles of quasi-beliefs and quasi-desires) that survive adversarial pressure and jailbreaks, distinguishing realized states from pretense (2025–26).
• Persona consistency measured under jailbreaks and at scale shows coherent value systems and identity-congruent motivated reasoning, suggesting unified persona-level states exist (2025–26).
• Run-to-run variance in persona outputs matches or exceeds inter-persona variance, implying model uncertainty drives behavior, not a stable persona-state (2024–25).
• Multi-turn RL reduces drift by ~55%, revealing drift is not monolithic but splits into local, global, and factual sub-failures, suggesting personas are loosely-coupled consistencies, not unified states (2025–26).

Anchor papers (verify; mind their dates):
• arXiv:2506.20020 (Jun 2025) — persona-assigned LLMs exhibit motivated reasoning
• arXiv:2511.00222 (Nov 2025) — multi-turn RL for persona consistency
• arXiv:2507.21509 (Jul 2025) — persona vectors and trait monitoring
• arXiv:2502.08640 (Feb 2025) — emergent value systems in AIs

Your task:
(1) RE-TEST COHERENCE CLAIMS. For each finding above (especially the ~55% drift reduction and motivated reasoning results), ask: have newer training methods (e.g., constitutional AI, iterative refinement, mechanistic steering), evaluation harnesses (multi-agent judges, long-horizon consistency tests), or interpretability tools (activation patching, persona vector decomposition) since strengthened or weakened the case that personas are unified states? Separate the durable question — *can* quasi-interpretivism scale to personas? — from the perishable constraint — *do current models have coherent personas?* Cite what resolved it.
(2) Surface the strongest recent work (last ~6 months) that contradicts the "realized dispositions" claim or proposes an alternative account of persona coherence (e.g., role-play-as-simulation, attention-mediated persona instantiation).
(3) Propose 2 research questions that assume the regime *has* shifted: e.g., "If mechanistic steering can now lock in persona-level coherence, does that change whether quasi-interpretivism is applicable?" or "Do personas unified only under specific prompting regimes still count as 'realized' in the philosophical sense?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines