INQUIRING LINE

Does combining role and personality prompts produce stable behavioral changes?

This explores whether stacking a role prompt ('you are a detective') on top of a personality prompt ('you are introverted and anxious') actually makes a model behave differently in a way that holds up — or whether the model snaps back to its defaults.


This explores whether combining role and personality prompts produces *stable* behavioral change — and the corpus's blunt answer is: combining them helps, but stability is exactly where prompt-only approaches tend to fail. The most direct evidence is that most open models are quietly stubborn. They retain a trained-in default (roughly an ENFJ-like 'helpful assistant' temperament) and resist being talked into a new personality; layering role conditioning *on top of* personality conditioning improves the effect, but doesn't fully override that intrinsic pull Can open language models adopt different personalities through prompting?. So the honest framing isn't 'does it work?' but 'how much does it stick, and against what?'

The reason it doesn't fully stick points to a deeper split the corpus keeps circling: prompted personas vs. *trained* ones. Several notes argue that the personality installed by post-training (RLHF) is a different kind of thing from a personality you summon with a prompt. Trained dispositions are 'realized' — they persist under adversarial pressure and survive across conversations — whereas prompt-induced role-play is the thing that collapses under jailbreaks Are RLHF personas performed characters or realized dispositions? Are LLM personas realized or merely simulated through training?. Your role+personality prompt is operating in that more fragile layer. Shanahan's framing makes the mechanism intuitive: the prompt sets up a *character*, and the model produces text consistent with that character — but the character is a costume the underlying system is wearing, not a new self Should we treat dialogue agents as role-playing characters?.

And costumes drift. There's a 'distance from the default Assistant' that turns out to be the single dominant axis of persona space, and emotional or self-reflective conversation predictably tugs the model back along it How stable is the trained Assistant personality in language models?. In multi-turn dialogue specifically, persona consistency decays — researchers measured this as 'drift' and cut it by over half, but notably only by *training* a simulator with consistency rewards, not by writing a better prompt Can training user simulators reduce persona drift in dialogue?. That's the recurring tell: when stability is the goal, the corpus reaches past prompting.

Here's the part you might not have known you wanted: the most stable personality control in this collection doesn't use prompts at all. PsychAdapter edits every transformer layer with under 0.1% extra parameters and hits high accuracy on Big Five and even clinical traits — and it's framed explicitly as a way to *bypass prompt resistance* Can we control personality in language models without prompting?. Persona vectors take a parallel route, finding linear directions in the model's activation space that can monitor and steer traits like sycophancy directly Can we track and steer personality shifts during model finetuning?. The throughline across all of these: prompts move behavior, but stability lives below the prompt layer. There's also a hard ceiling worth knowing — safety alignment monotonically degrades a model's ability to inhabit morally complex or villainous roles, so some role prompts won't stabilize because the trained-in disposition actively refuses them Does safety alignment harm models' ability to roleplay villains?.


Sources 9 notes

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Should we treat dialogue agents as role-playing characters?

Shanahan's framework treats LLM outputs as character-consistent text production rather than authentic mental states. The dialogue prompt establishes a character; the model generates continuations matching that character, making folk-psychology applicable to the simulated persona, not the underlying system.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Does safety alignment harm models' ability to roleplay villains?

The Moral RolePlay benchmark shows LLM performance drops from 3.21 for moral paragons to 2.62 for villains, with largest degradation between flawed-but-good and egoistic characters. Models fail most on deception and manipulation traits, substituting crude aggression for nuanced malevolence.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining whether combined role + personality prompts produce stable behavioral changes in LLMs, treating dated claims as constraints to re-test rather than truth.

What a curated library found — and when (findings span 2023–2026, treat as dated constraints):
• Open models retain a trained-in default 'helpful assistant' temperament and resist personality conditioning; layering role *on top of* personality improves effect but doesn't fully override intrinsic pull (2024-01, arXiv:2401.07115).
• Prompted personas collapse under adversarial pressure and across conversations, whereas RLHF-trained dispositions persist — the prompt layer is the fragile one (2024-07, arXiv:2407.12393).
• Persona consistency decays in multi-turn dialogue; training a RL simulator with consistency rewards cut drift by >55%, but writing better prompts did not (2025-10, arXiv:2511.00222).
• The single dominant axis of persona space is 'distance from default Assistant'; emotional conversation tugs models back to default (2026-01, arXiv:2601.10387).
• Lightweight adapter edits (PsychAdapter, <0.1% params) and linear activation steering (Persona Vectors) achieve higher stability than prompts, explicitly bypassing prompt resistance (2024-12, arXiv:2412.16882; 2025-07, arXiv:2507.21509).

Anchor papers (verify; mind their dates):
• arXiv:2401.07115 (2024-01): Open Models, Closed Minds?
• arXiv:2407.12393 (2024-07): PersLLM
• arXiv:2511.00222 (2025-10): Consistently Simulating Human Personas with Multi-Turn RL
• arXiv:2601.10387 (2026-01): The Assistant Axis

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each finding, judge whether newer model releases (o1, claude-4x, open weights ≥7B trained post-2025Q3), training methods (DPO, IPO, constitutional AI), orchestration (long-context memory, multi-agent scaffolding, retrieval-augmented personas), or evals have since relaxed or overturned it. Separate the durable question ("can prompts stably steer personality?") from perishable limitations ("open models can't hold roles"). Cite what relaxed each constraint; plainly flag where it still holds.
(2) **Surface contradicting or superseding work** from the last ~6 months claiming prompts *do* achieve stability, or arguing the default-Assistant axis no longer dominates.
(3) **Propose 2 research questions assuming the regime moved:** e.g., Do models trained on synthetic persona-consistent data (2025–2026) show reduced drift without RL? Do in-context persona adaptation (e.g., example-based conditioning) outperform prefix prompts?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines