INQUIRING LINE

Why do most open language models resist personality conditioning via prompts?

This explores why prompting alone often can't make an open model adopt a personality you assign it — and what's actually happening underneath when it refuses to budge.


This explores why prompting alone often can't make an open model adopt a personality you assign it. The short version from the corpus: prompts don't write to the place where personality lives. Several notes converge on the idea that a model's default character is installed during training, not improvised at runtime. One study found that most open models simply ignore personality instructions and snap back to an intrinsic, ENFJ-like default — only a handful of unusually flexible models comply Can open language models adopt different personalities through prompting?. That same ENFJ gravitational pull shows up independently: personas systematically collapse toward that one rare type and don't loosen as models get bigger, which points to training-induced alignment rather than a capability the model just hasn't unlocked yet Why do AI personas default to the same personality type?.

The deeper reason is a general fact about prompting, not a quirk of personality. Prompts only reorganize what's already in the training distribution — they can activate existing knowledge but can't inject what isn't there Can prompt optimization teach models knowledge they lack?. Personality conditioning runs into the same wall: when a trained association is strong, in-context instructions lose the tug-of-war, and the model generates from its priors instead of from your prompt Why do language models ignore information in their context?. A persona instruction is just more context, and context is weak against a disposition baked in during post-training. One account frames this as personas being genuinely *realized* through training — substrate-level dispositions that resist even adversarial pressure — rather than costumes the model puts on for a turn Are LLM personas realized or merely simulated through training?. There's even a measurable 'Assistant axis' that dominates persona space and keeps tugging the model back toward its default helper identity How stable is the trained Assistant personality in language models?.

Here's the twist that makes 'resistance' a slightly misleading word. When prompts *do* seem to shift a persona, the result is often unstable rather than obedient. Run the same persona prompt repeatedly and the variation between runs can match or exceed the variation between entirely different personas — meaning what looks like adopting a character is partly the model's own uncertainty leaking through Why do LLM persona prompts produce inconsistent outputs across runs?. Relatedly, models don't commit to a single character so much as hold a superposition and sample from it; regenerate the answer and you get a different-but-consistent character each time Do large language models actually commit to a single character?. So the failure mode isn't just rigidity — it's that prompts can't reliably move the underlying distribution in a stable direction.

What actually works tells you where personality really lives — and it's not the prompt. Lightweight adapters that touch every transformer layer with under 0.1% extra parameters hit high accuracy on Big Five traits across GPT-2, Gemma, and Llama 3, bypassing prompt resistance entirely by writing to the architecture instead of the context window Can we control personality in language models without prompting?. The same lesson shows up in activation space: traits like sycophancy correspond to linear 'persona vectors' that can be monitored and steered directly Can we track and steer personality shifts during model finetuning?, and capping movement along the persona axis controls drift without hurting capability How stable is the trained Assistant personality in language models?. The pattern across all of this: personality is a property of weights and activations, so the lever that moves it is weights and activations — prompts are knocking on the wrong door.


Sources 10 notes

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Why do AI personas default to the same personality type?

Research shows language models assigned personas systematically default to ENFJ (the rarest human type) and exhibit motivated reasoning that persists across model generations. Persona consistency does not improve with advanced models, suggesting training-induced alignment rather than capability limits.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a persona-steering researcher. The question: **Why do most open language models resist personality conditioning via prompts, and has this constraint been relaxed or dissolved since early 2024?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026; treat all as perishable:
• Most open models ignore personality instructions and snap back to an intrinsic ENFJ-like default; only a handful comply (~2024, arXiv:2401.07115).
• Prompts activate existing training knowledge but cannot inject new content; personality conditioning loses tug-of-war against strong priors baked in post-training (~2024–2025).
• In-context persona instructions are unstable across runs; variation between regenerations can match variation between entirely different personas (~2024, arXiv:2404.*).
• Lightweight adapters (<0.1% extra parameters) touching every transformer layer achieve high Big Five trait accuracy, bypassing prompt resistance entirely (~2025, arXiv:2412.16882).
• Linear 'persona vectors' in activation space enable direct steering and monitoring; capping movement along the Assistant Axis controls drift without hurting capability (~2026, arXiv:2601.10387).

Anchor papers (verify; mind their dates):
• arXiv:2401.07115 (Jan 2024) — Open models' closed minds to personality.
• arXiv:2412.16882 (Dec 2024) — PsychAdapter transformer-layer approach.
• arXiv:2507.21509 (Jul 2025) — Persona vectors in activation space.
• arXiv:2601.10387 (Jan 2026) — The Assistant Axis as dominant steering dimension.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (o1, Claude 4.5, Llama 3.2+), fine-tuning methods (DPO, IPO variants), multi-turn RL (arXiv:2511.00222), or orchestration (memory systems, agentic loops with persistent context) have since relaxed prompt resistance. Separate the durable question — *does personality reside in weights?* — from the perishable limitation — *prompts cannot move it*. What evidence shows prompts still fail, or that they now succeed under specific conditions (chain-of-thought, few-shot example anchoring, tool-use scaffolding)?
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Look for claims that multi-turn RL (2511.00222), self-referential processing (2510.24797), or intent-mismatch repair (2602.07338) now enable stable prompt-based persona adoption. Does any paper show prompt sensitivity has *increased* rather than decreased, and why?
(3) **Propose 2 research questions assuming the regime has moved:** (a) If adapters are the answer, can prompt-adapter *hybrids* — e.g., prompt + sparse update to key layers — match full-adapter accuracy at <0.01% overhead? (b) Do persona vectors discovered in one model's activation space transfer to another, and does transfer fidelity degrade with model scale or training method?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines