Can general chatbot skill predict how well models roleplay adversarial personas?
This explores whether a model's general assistant competence is a good proxy for its ability to convincingly play personas that pull *against* its trained defaults — and the corpus suggests the relationship may actually be inverse rather than predictive.
This reads the question as: if a model is a strong general chatbot, does that skill carry over to roleplaying adversarial or oppositional personas? The corpus doesn't measure that correlation head-on, but it reframes the question in a way that's more interesting than a yes/no — because the very thing that makes a model a good general assistant may be what makes adversarial roleplay *harder*, not easier.
The sharpest clue comes from work mapping the geometry of persona space, where the single dominant dimension turns out to be distance from the default Assistant, and post-training keeps models loosely but persistently tethered to that Assistant mode How stable is the trained Assistant personality in language models?. An adversarial persona is, almost by definition, a request to move far along that axis. Two related accounts of post-training argue that the Assistant isn't a costume the model puts on — it's a *realized* disposition installed during training that stays sticky and resists adversarial pressure, in contrast to prompt-induced role-play that collapses under jailbreaks Are RLHF personas performed characters or realized dispositions? Are LLM personas realized or merely simulated through training?. Put those together and you get a counterintuitive prediction: the better and more robustly a model has been tuned into the helpful-assistant groove, the more it may snap back toward that default when asked to sustain a hostile or oppositional character. General skill and adversarial-persona fidelity could trade off rather than track each other.
There's also a measurement problem hiding in the question. "How well a model roleplays" isn't one thing. Research on user simulators separates several distinct failure modes — local drift within a turn, global drift across a conversation, and outright factual contradiction — and shows you can target them with consistency rewards to cut drift by more than half Can training user simulators reduce persona drift in dialogue?. A model could be excellent at staying in character locally yet quietly drift back to its Assistant register over many turns, exactly where adversarial personas are hardest to hold. So a single "general skill" number wouldn't predict performance because there isn't a single performance axis to predict.
And how humans judge a chatbot is itself dominated by something orthogonal to roleplay craft: studies of how people mentally model dialogue partners find that perceived *competence* accounts for roughly half the variance in user impressions, well ahead of human-likeness or conversational flexibility How do users mentally model dialogue agent partners?. That means a model can read as a "good chatbot" mostly by seeming capable — a judgment that says little about whether it can inhabit a persona that's supposed to be incompetent, evasive, or hostile. The skills are being scored on different dimensions.
The corpus does hint at what *would* predict good adversarial roleplay, and it's not general fluency — it's grounding. Persona quality improves when the character is built from expert-written profiles plus retrieved memories relevant to that character's psychology Can LLMs predict character choices from narrative context?, from structured multi-layer specification combining traits, subtopics, and context Can synthetic dialogues become realistic through layered diversity?, or from personas treated as evolving intermediaries optimized at test time against feedback Can personas evolve in real time to match what users actually want?. The honest answer, then: the library has no direct evidence that general chatbot skill predicts adversarial-persona performance, and several lines of work suggest the assistant disposition that defines a strong general chatbot is precisely the gravity an adversarial persona has to escape.
Sources 8 notes
Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.
Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.
Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
The Partner Modelling Questionnaire reveals that perceived competence dominates user impressions (49% of variance), followed by human-likeness (32%) and communicative flexibility (19%). This three-factor structure reflects how people evaluate dialogue partners against both functional and social standards.
The LIFECHOICE benchmark (1,462 decisions across 388 novels) shows LLMs predict character choices better when given expert-written persona profiles paired with retrieved memories relevant to the character's psychology. This persona-based approach outperforms automated summarization by 5%.
Research shows that realistic synthetic dialogues require three multiplicative layers: subtopic specificity, Big Five persona variation, and 11 contextual characteristics via Chain of Thought reasoning. This structured approach captures 90.48% of in-domain dialogue performance.
PersonaAgent uses structured personas to bridge episodic/semantic memory and personalized actions, optimizing them at test time by simulating recent interactions against textual feedback. Learned personas cluster meaningfully in latent space, suggesting genuine user-specific separation beyond standard post-training drift.