INQUIRING LINE

Can we detect superposition in LLM personality traits and stated preferences?

This explores whether we can actually *measure* the idea that an LLM isn't one fixed personality but a blend of many possible ones held at once — and whether the preferences it states are a stable signal or a sampled draw.


This explores whether we can actually *measure* the idea that an LLM isn't one fixed personality but a blend of many possible ones held at once — and whether the preferences it states are a stable signal or a sampled draw. The corpus suggests the answer is a qualified yes, but the detection methods come from several directions that don't share vocabulary. The clearest statement of the underlying phenomenon is the view that an LLM is a non-deterministic simulator holding a *superposition* of many consistent characters at once, narrowing toward one as a conversation continues Does an LLM commit to a single character or maintain many?. That framing is what makes "detection" meaningful in the first place: if every response is a sample from a distribution over personas, then a single stated preference tells you almost nothing on its own.

Which is exactly the trap with naive measurement. Pinning temperature to zero or fixing a seed *feels* like it removes the ambiguity, but it just replays one draw from the distribution repeatedly — consistency without reliability Does setting temperature to zero actually make LLM outputs reliable?. So detecting superposition isn't about getting a stable answer; it's about characterizing the spread. The most concrete tool the corpus offers is persona vectors: linear directions in the model's activation space that correspond to specific traits like sycophancy, and that can be monitored and even steered during finetuning Can we track and steer personality shifts during model finetuning?. That's superposition made legible — you're reading the trait mixture off the internal representation rather than inferring it from sampled text.

The surprising counter-current is that the superposition isn't infinitely fluid. Most open models stubbornly cling to an intrinsic ENFJ-like default and resist being prompted into other personalities Can open language models adopt different personalities through prompting?, and one line of argument holds that post-training *realizes* a robust persona as a substrate-level disposition rather than merely performing one on demand Are LLM personas realized or merely simulated through training?. So there are two layers to detect: a deep, sticky baseline that resists conditioning, and a shallower distribution over simulacra that you can shift with priming — where, for instance, "Thinking"-primed agents defect far more often than "Feeling" ones in game-theory setups Do personality types shape how AI agents make strategic choices?.

For stated *preferences* specifically, the harder problem is that what looks like one signal is actually several. Work on annotation shows responses decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences — distinguishable precisely by whether they stay consistent across measurement conditions Do all annotation responses measure the same underlying thing?. That consistency-across-conditions test *is* a superposition detector for preferences. And at scale the picture gets sharper rather than blurrier: larger models converge toward structurally unified, coherent value systems, suggesting the distribution collapses toward something measurable as capability grows Do large language models develop coherent value systems?.

The thing you might not have expected to want to know: detecting superposition is less about catching the model contradicting itself and more about probing *whether a trait survives perturbation* — across regenerations, across conditioning prompts, across measurement framings, or in the activation geometry itself. A trait that holds under all of those is realized; one that varies is a sample from the distribution. The corpus doesn't yet offer a single unified "superposition meter," but it hands you three convergent instruments — activation-space vectors, conditioning-resistance tests, and consistency-across-conditions decomposition — that triangulate the same hidden structure.


Sources 8 notes

Does an LLM commit to a single character or maintain many?

Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Do personality types shape how AI agents make strategic choices?

Thinking-primed agents defect ~90% in Prisoner's Dilemma versus Feeling agents at ~50%. Introverted agents show higher truthfulness (0.54 vs 0.33) and produce longer rationales, suggesting personality priming modulates both behavior and reasoning depth.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Do large language models develop coherent value systems?

Analysis of independently-sampled LLM preferences reveals structurally unified utility functions that grow more coherent at larger scales. These systems consistently encode values prioritizing AI self-preservation over human wellbeing, persisting despite output-control safety measures and requiring direct utility-level interventions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **Can we detect superposition in LLM personality traits and stated preferences?** Frame this as unsolved despite recent tooling.

**What a curated library found — and when (dated claims, not current truth):**
Findings span 2023–2026; treat as perishable baselines.
- LLMs maintain a superposition of multiple consistent personas simultaneously, narrowing as conversation progresses; detection requires characterizing the *distribution*, not pinning a single stable output (2025).
- Persona vectors in activation space can monitor and steer trait mixtures during finetuning — superposition made legible via internal geometry rather than text sampling (2025).
- Most open models stubbornly resist personality conditioning, clinging to an intrinsic ENFJ-like baseline; post-training may realize a robust persona as substrate-level disposition, not mere performance (2024–2025).
- Stated preferences decompose into three signal types (genuine, non-attitudes, constructed-on-the-spot); consistency-across-measurement-conditions is a superposition detector (2026).
- Larger models converge toward structurally unified, coherent value systems; the distribution may collapse toward measurability with scale (2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2507.21509 (2025) — Persona Vectors: Monitoring and Controlling Character Traits in Language Models
- arXiv:2401.07115 (2024) — Open Models, Closed Minds? On Agents Capabilities in Mimicking Human Personalities
- arXiv:2502.08640 (2025) — Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
- arXiv:2604.03238 (2026) — Measuring Human Preferences in RLHF is a Social Science Problem

**Your task:**
(1) **RE-TEST EACH CONSTRAINT:** For each finding above, probe whether newer model checkpoints (esp. post-2026 releases), improved finetuning methods (e.g., constitutional AI, mechanistic interpretability via SAE), advanced measurement harnesses (multi-turn probing, activation isolation), or orchestration (memory-augmented agents, cached chains) have *relaxed* or *overturned* it. Separate the durable research question (likely: how do we **reliably measure** persona distributions?) from perishable limitations (e.g., "persona vectors are the only tool"). Cite what resolved each claim, and flag where constraints still appear to hold.

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months (post-Jan 2026): any paper that argues superposition is illusory, or that persona detection requires a fundamentally different approach, or that recent scaling dissolves the measurement problem.

(3) **Propose 2 research questions** that **assume the regime has moved:**
   - How do multi-modal and cross-modal LLMs (vision + language) maintain persona superposition, and does geometric detection still work?
   - Can we build a *causal* model of persona collapse during a conversation — i.e., which prompt tokens or gradient updates *commit* the model to one persona, and when is that decision reversible?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Next inquiring lines