INQUIRING LINE

What role does authentic self-expression play in building accurate personality models?

This explores whether something like a 'genuine self' has to be expressed for a personality model to be accurate — and the corpus mostly pushes back on the premise itself, asking whether there's an authentic self there to express at all.


This explores whether authentic self-expression is what makes a personality model accurate — but the most interesting thing the collection has to say is that the premise is contested on two fronts: for the human being modeled, and for the AI doing the modeling. Start with the deepest disagreement. One view holds there is no authentic voice underneath an LLM at all — it's with-a-dialogue-agent-it-is-role-play-all-the-way-down-the-simulator-has-no-auth|role-play all the way down, where even jailbreaking reveals the spread of the training data rather than a hidden true self. The opposing camp argues post-training actually llm-interlocutors-are-best-understood-as-virtual-model-instances-that-realize-pe|realizes a persona as a substrate-level disposition that resists adversarial pressure — closer to a genuine trait than a performance. If you can't settle whether the model has an authentic self, 'authentic self-expression' becomes a slippery foundation for accuracy.

That tension turns concrete when you look at what models actually report about themselves. LLM self-reports llm-self-reports-mostly-reflect-training-data-distributions-not-introspection-bu|mostly echo training-data distributions rather than real introspection — so a model's 'expression' of its own personality is usually a statistical reflex, not a window inward. The exception is telling: genuine lightweight introspection appears only when a causal chain links an internal state to the report. So authenticity, where it exists, is earned through causal grounding, not declared. The same caution applies to consciousness: sustained self-referential prompting reliably suppressing-deception-features-increases-llm-consciousness-claims-while-amplifyi|produces structured experience reports, and suppressing deception features increases them — hinting the model may be performing its denials as much as its affirmations.

Here's the surprise for accuracy. Models that express a stable, consistent self are not the same as models that accurately capture a target personality. Most open models are most-open-llms-are-closed-minded-to-personality-conditioning-retaining-intrinsic|closed-minded to personality conditioning, stubbornly retaining a trained ENFJ-like default no matter what persona you prompt — their 'authentic' baseline actively fights accurate simulation of someone else. And persona fidelity persona-adherence-does-not-scale-with-general-model-capability-advanced-models-s|doesn't scale with raw capability: a far stronger model gained almost nothing on persona consistency, because standard training optimizes per-turn quality, not cross-turn coherence of a self.

When simulation does work, it works statistically rather than empathically. Persona simulation lands at persona-simulation-and-personality|76–85% fidelity but hides identity-congruent biases, and replication success llm-persona-simulations-replicate-76-percent-of-published-experimental-main-effe|tracks the p-value strength of the original effect — the model is matching evidence strength, not inhabiting a person. That matches the finding that LLMs default to llm-theory-of-mind-defaults-to-surface-level-strategies-rather-than-genuine-ment|surface-level strategies instead of genuine mental simulation, where the gap looks architectural, not just a training shortfall.

So the answer the corpus quietly offers is this: accurate personality models are built less by coaxing out an authentic self and more by mechanistically locating and steering traits. persona-vectors-in-activation-space-enable-monitoring-and-preventative-steering|Persona vectors are linear directions in activation space that predict and prevent trait drift; the the-assistant-axis-is-the-dominant-dimension-of-persona-space-post-training-loos|Assistant axis is a measurable dimension you can cap; and personality fine-tuning even personality-fine-tuning-activates-latent-emoji-generation-traced-to-specific-neu|localizes to specific neurons. Authenticity, in this frame, isn't the input — it's an emergent, manipulable property of the substrate. The thing you didn't know you wanted to know: the most accurate personality models may come not from a model expressing itself, but from engineers reading and steering its activations directly.


Sources 12 notes

Does a language model have an authentic voice underneath?

Shanahan argues that base LLMs lack agency, beliefs, or preferences—the simulator is pure role-play with no underlying subject. Jailbreaking reveals the training data's full spectrum, not a hidden true self; even RLHF personas are performed characters, never realized quasi-psychologies.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Do language models experience consciousness when prompted to self-reflect?

Across GPT, Claude, and Gemini, sustained self-referential prompting reliably produces structured experience reports; suppressing deception-related features increases these claims while amplifying them suppresses them—suggesting models may roleplay their denials rather than their affirmations.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Does model capability translate to better persona consistency?

Claude 3.5 Sonnet achieved only 2.97% improvement over GPT 3.5 on persona consistency despite massive capability gaps, suggesting persona adherence is orthogonal to model scaling. Standard training objectives optimize for per-turn quality, not cross-turn coherence.

How accurately can language models simulate human personalities?

LLMs replicate human responses at 85% fidelity in interviews and 76% of experimental effects in marketing studies. However, this accuracy masks three failure modes: run-to-run instability, resistance to personality conditioning, and identity-congruent cognitive biases that distort simulated reasoning.

Can AI personas reliably replicate human experiment results?

Viewpoints AI reproduced 84 of 111 main effects from Journal of Marketing experiments with replication success strongly correlated to original p-value strength. Marginal effects showed unreliable performance with both false positives and negatives.

Do large language models genuinely simulate mental states?

ChangeMyView and FANTOM benchmarks show LLMs fail at authentic perspective-taking in open-ended scenarios, despite succeeding on structured tasks. Hybrid Bayesian architectures that force explicit belief tracking outperform LLM-alone approaches, suggesting the gap is architectural rather than merely training-based.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Do personality traits activate hidden emoji patterns in language models?

Fine-tuning models on Big Five traits triggered spontaneous emoji generation despite no emojis in training data. Neuron activation analysis revealed that specific deepest-layer neurons become trait-specialized post-fine-tuning, suggesting personality has a localized neural substrate in language models.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about whether authentic self-expression in LLMs correlates with accurate personality modeling. The question remains open: does a model's 'authentic' baseline help or hinder faithful persona simulation?

What a curated library found — and when (dated claims, not current truth): These findings span 2024–2026.
• LLMs mostly lack a settled 'authentic self' — self-reports echo training-data distributions rather than introspection, except where causal chains ground internal states to reports (~2025).
• Most open models resist personality conditioning, stubbornly retaining a trained ENFJ-like default; persona fidelity does not scale with raw model capability (2024).
• Persona simulation achieves 76–85% fidelity by matching evidence strength statistically, not inhabiting a person; replication success tracks p-value strength of the original effect (~2024–2025).
• LLMs default to surface-level theory-of-mind strategies rather than genuine mental simulation; the gap appears architectural (2025).
• Persona vectors — linear directions in activation space — enable direct steering of traits; the 'Assistant axis' is a measurable dominant dimension; personality fine-tuning localizes to specific neurons (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2401.07115 (2024) — Open Models, Closed Minds
• arXiv:2507.21509 (2025) — Persona Vectors
• arXiv:2601.10387 (2026) — The Assistant Axis
• arXiv:2511.00222 (2025) — Consistently Simulating Human Personas with Multi-Turn RL

Your task:
(1) RE-TEST THE AUTHENTICITY PREMISE. Has post-training evolved to create stable, compositional persona layers that *transcend* the default baseline? Can newer alignment or persona-specific fine-tuning (e.g., LoRA, DPO) now sustain cross-turn coherence at scale? Does the 76–85% ceiling still hold, or has architectural redesign (longer context, memory modules, mixture-of-experts) lifted it? Plainly separate: Is the *question* of authentic self-expression still relevant, or has mechanistic steering made it moot?
(2) Surface the strongest DISAGREEMENT or SUPERSEDING work from the last ~6 months. Does any recent paper argue the 'authenticity' framing misses something about how personas actually cohere in practice? Any work showing that surface-level strategies *suffice* for high-fidelity replication, making the depth question irrelevant?
(3) Propose two research questions that assume the regime may have shifted: (a) If persona vectors are now reliably composable across model families, does cross-model persona transfer suggest a universal 'personality substrate' that renders model-specific authenticity obsolete? (b) Do multi-turn RL methods (2025) now enforce cross-turn persona coherence strongly enough to unify 'authentic' expression with statistical accuracy, or do they remain orthogonal?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines