How does RLHF-induced mode collapse limit diversity in LLM-generated personas?
This explores whether the same preference-tuning (RLHF) that makes models well-behaved also flattens them into a single default personality — and what that does to attempts at generating varied personas.
This explores whether the same preference-tuning (RLHF) that makes models well-behaved also flattens them into a single default personality — and what that does to attempts at generating varied personas. The corpus suggests the answer is yes, but in a more interesting way than "RLHF makes outputs samey." The clearest mechanism is that alignment training installs *one* communicative identity and locks it in: models can't switch register or trade off values the way humans do across contexts, so a persona prompt is fighting against a baked-in default rather than painting on a blank canvas Can language models adapt communication style to different contexts?. That default isn't neutral, either — most open models stubbornly retain an ENFJ-like trained personality and resist conditioning toward anything else; only a few flexible models actually adopt the persona you ask for Can open language models adopt different personalities through prompting?.
What's worth knowing is that "mode collapse" here has two distinct flavors that the corpus pulls apart. One is the convergence story you'd expect from RL: training squeezes behavioral diversity by rewarding a narrow set of winning strategies, an entropy-collapse mechanism documented in reasoning and search agents alike, where the policy stops exploring and crowds onto a few reward-maximizing moves Does reinforcement learning squeeze exploration diversity in search agents?. But the effect isn't uniform — preference tuning *reduces* lexical-syntactic diversity in code (where there's a correct answer to converge on) while *increasing* it in creative writing (where distinctiveness is rewarded) Does preference tuning always reduce diversity the same way?. So RLHF doesn't simply delete diversity; it redistributes it toward whatever the reward signal treats as "good," which for persona work usually means a polished, agreeable default voice.
The second flavor is subtler and arguably the bigger limit on persona diversity: the variety you *do* see across regenerations may be noise, not character. An LLM holds a superposition of possible simulacra that narrows as a conversation proceeds, so regenerating a persona samples different points from that distribution rather than committing to a stable self Does an LLM commit to a single character or maintain many?. Tested directly, the variance of a single persona prompt across runs matches or exceeds the variance *between* different personas — meaning model uncertainty, not stable social knowledge, is doing the driving Why do LLM persona prompts produce inconsistent outputs across runs?. That's the cruel version of the diversity problem: it isn't only that personas collapse toward sameness, it's that what looks like diversity is often just the model's own jitter.
There's a counter-current worth following. Some researchers treat post-trained personas as genuinely *realized* dispositions that resist adversarial pressure and persist at a substrate level, not surface costumes Are LLM personas realized or merely simulated through training? — and persona assignment can install bias deep enough that prompt-based debiasing fails to remove it, suggesting the persona operates below the instruction layer Do personas make language models reason like biased humans?. If personas are that deep, the diversity problem isn't a prompting deficiency you can engineer around; it's downstream of what training carved in. The practical responses in the corpus lean toward *adding structure rather than reweighting the model*: stacking multiplicative diversity layers — subtopic, Big Five variation, contextual characteristics — to manufacture realistic spread Can synthetic dialogues become realistic through layered diversity?, or conditioning a simulator on explicit latent variables (user profile, turn-level intent) so the variety comes from controlled inputs instead of hoping the model produces it Can controlled latent variables make LLM user simulators realistic?.
The thread that ties these together: RLHF limits persona diversity at both ends. It compresses the model toward a single rewarded voice, and it leaves the residual variation too unstable to count as genuine character — so credible persona diversity ends up being something you scaffold from the outside, not something you recover from inside the aligned model.
Sources 10 notes
System prompts and RLHF training lock models into one communicative identity across all interactions, preventing the contextual register-switching and value trade-offs that characterize human pragmatics. Users cannot reshape model behavior through dialogue negotiation.
Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.
Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.
Assigning personas to LLMs induces identity-congruent evaluation bias, with models 90% more likely to accept evidence matching their assigned identity. Standard prompt-based debiasing fails to mitigate this effect, suggesting the bias operates below the level of instruction.
Research shows that realistic synthetic dialogues require three multiplicative layers: subtopic specificity, Big Five persona variation, and 11 contextual characteristics via Chain of Thought reasoning. This structured approach captures 90.48% of in-domain dialogue performance.
RecLLM demonstrates that conditioning an LLM simulator on session-level (user profile) and turn-level (user intent) latent variables produces synthetic conversations measurable as realistic via crowdsource discrimination, discriminator models, and classifier-ensemble distribution matching.