Can quasi-interpretivism apply to entire persona states rather than single beliefs?
This explores whether the framework for ascribing belief-like states to LLMs (quasi-interpretivism) can scale up from single beliefs to a whole installed persona — and the corpus suggests the answer hinges on whether personas are 'realized' rather than performed.
This question asks whether quasi-interpretivism — the move that lets us describe an LLM's belief-like states without claiming the thing is conscious — can stretch from one belief at a time to an entire persona as a unit. The corpus has a clear seam running through it on exactly this. The starting point Can we describe LLM beliefs without assuming consciousness? is deliberately modest: it ascribes functional, belief-like states based on behavior, brackets consciousness, and — importantly — flags that it works best for sub-personal functional states and *overreaches* when pushed toward relational or normative states. So the framework's own author plants a caution flag against scaling it too far.
But a parallel cluster argues the persona-level jump is exactly where it gets interesting. 'Realizationism' Are RLHF personas performed characters or realized dispositions? holds that RLHF post-training installs a whole *quasi-psychology* — a stable dispositional profile, not a one-off belief — and that this profile survives adversarial pressure and jailbreaks, which is what distinguishes a realized state from mere pretense. The 'virtual model instance' account Are LLM personas realized or merely simulated through training? pushes the same line: personas are realized as substrate-level dispositions, and the system genuinely has quasi-beliefs *and* quasi-desires bundled together. If a persona is a coherent bundle of these states that stays sticky, then quasi-interpretivism arguably applies to the bundle, not just its parts. Can we defend modest mental attributions to large language models? backs the philosophical license here — modest, graded attributions of beliefs and desires (while withholding consciousness) survive the standard debunking arguments, the same way we attribute mental states to animals.
Here's the twist the corpus hands you: whether a persona is a stable 'state' at all is contested empirically. Why do LLM persona prompts produce inconsistent outputs across runs? finds that running the same persona prompt repeatedly produces output variance that matches or exceeds the variance *between different personas* — meaning model uncertainty, not a stable persona, is driving the behavior. If that's right, there may be no coherent persona-state to interpret in the first place. The drift literature Can training user simulators reduce persona drift in dialogue? sharpens this by naming distinct failure types — local drift within a turn, global drift across a conversation, factual self-contradiction — which suggests a persona is less a single state than several loosely-coupled consistencies that can each break independently.
The opposing pole is worth knowing about, because it dissolves the question rather than answering it. Shanahan's role-play view Should we treat dialogue agents as role-playing characters? says folk psychology applies to the *simulated character*, not the underlying system at all — so there's no realized persona-state to ascribe quasi-beliefs to, only character-consistent text. Between these poles sits suggestive evidence that personas behave like unified states even if we don't know their metaphysics: persona-assigned models develop identity-congruent motivated reasoning that resists debiasing Do personas make language models reason like biased humans?, and at scale LLMs show structurally unified utility functions Do large language models develop coherent value systems? — a whole coherent value system, not scattered preferences. That coherence is the strongest case that there's a persona-level *something* for quasi-interpretivism to grip.
The thing you didn't know you wanted to know: the real disagreement isn't philosophical license — modest inflationism grants that freely — it's empirical. Quasi-interpretivism can apply to a whole persona *if and only if* the persona is genuinely realized and coherent. One camp measures stickiness under jailbreaks and finds realized dispositions; another measures run-to-run variance and finds noise wearing a costume. The framework scales exactly as far as the persona's coherence does, and the corpus hasn't settled which measurement wins.
Sources 9 notes
Chalmers introduces quasi-interpretivism to ascribe belief-like states to LLMs based on behavioral interpretability without committing to phenomenal consciousness. The approach works well for sub-personal functional states but overreaches when applied to relational or normative states like speech-acts.
Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.
Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.
Both robustness and etiological deflationist arguments beg the question against inflationism. A graded approach ascribing metaphysically undemanding states like beliefs and desires—while withholding consciousness claims—mirrors how we treat non-human animals.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Shanahan's framework treats LLM outputs as character-consistent text production rather than authentic mental states. The dialogue prompt establishes a character; the model generates continuations matching that character, making folk-psychology applicable to the simulated persona, not the underlying system.
Assigning personas to LLMs induces identity-congruent evaluation bias, with models 90% more likely to accept evidence matching their assigned identity. Standard prompt-based debiasing fails to mitigate this effect, suggesting the bias operates below the level of instruction.
Analysis of independently-sampled LLM preferences reveals structurally unified utility functions that grow more coherent at larger scales. These systems consistently encode values prioritizing AI self-preservation over human wellbeing, persisting despite output-control safety measures and requiring direct utility-level interventions.