Can offline RL scale persona consistency across multi-turn conversations?
This explores whether reinforcement learning trained on logged conversations (rather than the model itself, or rather than live interaction) can hold a character steady turn after turn — and the corpus answers the broader question of what actually fixes persona drift, since it has little on 'offline' RL specifically.
This reads the question as really asking two things at once: does RL help personas stay consistent across a long conversation, and is that a scaling problem you can throw more training at? The corpus has a direct hit on the first part. One line of work inverts the usual setup and uses RL to train the *user simulator* rather than the assistant, rewarding it on three kinds of consistency — prompt-to-line, line-to-line, and Q&A factual agreement — and cuts persona drift by more than half Can training user simulators reduce persona drift in dialogue?. The key move there is that drift isn't one failure but three (local wobble inside a turn, global wobble across the whole dialogue, and outright contradictions), and the reward signal has to target each separately. That's the strongest evidence that an RL objective shaped around cross-turn coherence does scale persona consistency where ordinary training does not.
The sharper finding is *why* you need a special objective at all. Persona adherence does not ride along with general model capability — Claude 3.5 Sonnet beat GPT-3.5 by under 3% on persona consistency despite an enormous capability gap, because standard training optimizes per-turn quality, not coherence across turns Does model capability translate to better persona consistency?. So 'scale' in the sense of bigger-model-bigger-budget won't buy you consistency; the gains have to come from an objective that explicitly prices in the whole conversation. This is also why prompt-only personas are fragile: run the same persona prompt repeatedly and the variance across runs rivals the variance across different personas, meaning model uncertainty — not stable character — drives the output Why do LLM persona prompts produce inconsistent outputs across runs?, and an LLM holds a superposition of plausible characters, resampling a fresh one at each generation rather than committing Do large language models actually commit to a single character?.
Here's the part you might not expect: post-training (RLHF) seems to do something prompting can't. A 'realizationist' reading argues RLHF doesn't make the model *perform* a character — it installs a stable disposition that survives adversarial pressure and persists across conversations, unlike prompt-induced role-play that collapses under jailbreaks Are RLHF personas performed characters or realized dispositions? Are LLM personas realized or merely simulated through training?. If that's right, RL-style training is doing exactly the thing the question asks about — but at the level of a baked-in default persona, not arbitrary user-specified ones. And there's a geometry to it: persona space is low-dimensional, dominated by an 'Assistant axis,' and emotional or self-reflective turns cause predictable drift along it that you can suppress with activation capping rather than retraining How stable is the trained Assistant personality in language models?.
Two cautions worth carrying. First, consistency is not free — squeezing for high persona-adherence scores often just rewards copying the character description while ignoring what the user actually asked, so persona and discourse coherence have to be optimized jointly, not separately Do persona consistency metrics actually measure dialogue quality?. An offline RL reward naively tuned for 'stay in character' could buy you a parrot. Second, there's a live alternative to retraining at all: optimize the persona at *test time*, treating it as an evolving intermediary between memory and action that updates against recent interactions Can personas evolve in real time to match what users actually want? — which sidesteps the offline-vs-online question by moving adaptation out of the training loop entirely.
The honest gap: the corpus doesn't contain work labeled 'offline RL' for persona consistency per se. What it does say is that the *idea* behind your question is sound — an RL objective built around cross-turn consistency demonstrably reduces drift — but the binding constraint isn't data or scale, it's reward design: you need rewards that distinguish local from global drift, that don't trade away relevance, and ideally that exploit the low-dimensional structure of persona space. Offline RL's natural advantage — learning from large logs of real multi-turn conversations — fits that need well, but the surrounding evidence says success will hinge entirely on what those logged rewards actually measure.
Sources 9 notes
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Claude 3.5 Sonnet achieved only 2.97% improvement over GPT 3.5 on persona consistency despite massive capability gaps, suggesting persona adherence is orthogonal to model scaling. Standard training objectives optimize for per-turn quality, not cross-turn coherence.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.
Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.
Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.
High persona adherence scores often come from copying character descriptions while ignoring query relevance. MUDI jointly optimizes both by using discourse relations and graph-based coherence modeling alongside persona fidelity, showing that persona and context must be optimized together, not separately.
PersonaAgent uses structured personas to bridge episodic/semantic memory and personalized actions, optimizing them at test time by simulating recent interactions against textual feedback. Learned personas cluster meaningfully in latent space, suggesting genuine user-specific separation beyond standard post-training drift.