INQUIRING LINE

Can general chatbot skill predict how well models roleplay adversarial personas?

This explores whether a model's general assistant competence is a good proxy for its ability to convincingly play personas that pull *against* its trained defaults — and the corpus suggests the relationship may actually be inverse rather than predictive.


This reads the question as: if a model is a strong general chatbot, does that skill carry over to roleplaying adversarial or oppositional personas? The corpus doesn't measure that correlation head-on, but it reframes the question in a way that's more interesting than a yes/no — because the very thing that makes a model a good general assistant may be what makes adversarial roleplay *harder*, not easier.

The sharpest clue comes from work mapping the geometry of persona space, where the single dominant dimension turns out to be distance from the default Assistant, and post-training keeps models loosely but persistently tethered to that Assistant mode How stable is the trained Assistant personality in language models?. An adversarial persona is, almost by definition, a request to move far along that axis. Two related accounts of post-training argue that the Assistant isn't a costume the model puts on — it's a *realized* disposition installed during training that stays sticky and resists adversarial pressure, in contrast to prompt-induced role-play that collapses under jailbreaks Are RLHF personas performed characters or realized dispositions? Are LLM personas realized or merely simulated through training?. Put those together and you get a counterintuitive prediction: the better and more robustly a model has been tuned into the helpful-assistant groove, the more it may snap back toward that default when asked to sustain a hostile or oppositional character. General skill and adversarial-persona fidelity could trade off rather than track each other.

There's also a measurement problem hiding in the question. "How well a model roleplays" isn't one thing. Research on user simulators separates several distinct failure modes — local drift within a turn, global drift across a conversation, and outright factual contradiction — and shows you can target them with consistency rewards to cut drift by more than half Can training user simulators reduce persona drift in dialogue?. A model could be excellent at staying in character locally yet quietly drift back to its Assistant register over many turns, exactly where adversarial personas are hardest to hold. So a single "general skill" number wouldn't predict performance because there isn't a single performance axis to predict.

And how humans judge a chatbot is itself dominated by something orthogonal to roleplay craft: studies of how people mentally model dialogue partners find that perceived *competence* accounts for roughly half the variance in user impressions, well ahead of human-likeness or conversational flexibility How do users mentally model dialogue agent partners?. That means a model can read as a "good chatbot" mostly by seeming capable — a judgment that says little about whether it can inhabit a persona that's supposed to be incompetent, evasive, or hostile. The skills are being scored on different dimensions.

The corpus does hint at what *would* predict good adversarial roleplay, and it's not general fluency — it's grounding. Persona quality improves when the character is built from expert-written profiles plus retrieved memories relevant to that character's psychology Can LLMs predict character choices from narrative context?, from structured multi-layer specification combining traits, subtopics, and context Can synthetic dialogues become realistic through layered diversity?, or from personas treated as evolving intermediaries optimized at test time against feedback Can personas evolve in real time to match what users actually want?. The honest answer, then: the library has no direct evidence that general chatbot skill predicts adversarial-persona performance, and several lines of work suggest the assistant disposition that defines a strong general chatbot is precisely the gravity an adversarial persona has to escape.


Sources 8 notes

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

Are LLM personas realized or merely simulated through training?

Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

How do users mentally model dialogue agent partners?

The Partner Modelling Questionnaire reveals that perceived competence dominates user impressions (49% of variance), followed by human-likeness (32%) and communicative flexibility (19%). This three-factor structure reflects how people evaluate dialogue partners against both functional and social standards.

Can LLMs predict character choices from narrative context?

The LIFECHOICE benchmark (1,462 decisions across 388 novels) shows LLMs predict character choices better when given expert-written persona profiles paired with retrieved memories relevant to the character's psychology. This persona-based approach outperforms automated summarization by 5%.

Can synthetic dialogues become realistic through layered diversity?

Research shows that realistic synthetic dialogues require three multiplicative layers: subtopic specificity, Big Five persona variation, and 11 contextual characteristics via Chain of Thought reasoning. This structured approach captures 90.48% of in-domain dialogue performance.

Can personas evolve in real time to match what users actually want?

PersonaAgent uses structured personas to bridge episodic/semantic memory and personalized actions, optimizing them at test time by simulating recent interactions against textual feedback. Learned personas cluster meaningfully in latent space, suggesting genuine user-specific separation beyond standard post-training drift.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst probing whether general chatbot competence predicts adversarial-persona fidelity in LLMs—a question that may have shifted as models, training methods, and eval harnesses have evolved since mid-2023.

What a curated library found — and when (dated claims, not current truth): Findings span 2023–2026 and center on a potential *trade-off* rather than correlation:
• The 'Assistant Axis' is the dominant dimension of persona space; post-training keeps models tethered to default-Assistant mode, making adversarial personas harder to sustain the stronger the general-assistant tuning (2026).
• Multi-turn consistency failures (local drift, global drift, factual contradiction) are distinct; RL-based consistency rewards can cut drift by ~55% (2025).
• Persona quality improves via grounded specification (expert profiles + retrieved memories, multi-layer traits/subtopics/context) rather than general fluency alone (2024–2025).
• Perceived *competence* accounts for ~50% of user judgments—orthogonal to roleplay fidelity, suggesting 'good chatbot' skill and 'good adversary' skill are scored on different dimensions (2023–2024).

Anchor papers (verify; mind their dates):
• arXiv:2601.10387 (2026) — The Assistant Axis
• arXiv:2511.00222 (2025) — Multi-Turn RL for Persona Consistency
• arXiv:2506.06254 (2025) — PersonaAgent: Test-Time Personalization
• arXiv:2406.01171 (2024) — Two Tales of Persona in LLMs (survey)

Your task:
(1) RE-TEST THE TRADE-OFF HYPOTHESIS. Has the Assistant Axis finding held under new models (o1, Claude 3.5+, Llama 3.2+)? Do newer RL methods (DPO, IPO, or online RL) weaken the sticky tether to default mode? Separate the durable claim—that adversarial personas require escaping a learned disposition—from the perishable one that post-training *prevents* that escape. Where does the constraint still hold, and what training/orchestration has relaxed it?

(2) Surface the strongest recent work (last 6 months) that *contradicts* the trade-off—i.e., evidence that general skill and adversarial fidelity *do* correlate, or that the Assistant Axis is no longer dominant under new architectures, training regimes, or eval metrics.

(3) Propose 2 forward-looking research questions that *assume* the regime may have shifted: (a) one on whether test-time adaptation (e.g., in-context grounding, dynamic memory injection) can decouple adversarial-persona fidelity from general-assistant tuning; (b) one on whether multi-agent orchestration (e.g., adversary + judge + monitor) achieves better sustained roleplay than single-model persona control.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines