Can online RL and trainable agents maintain persona consistency better than fixed environments?
This explores whether *training* a persona into an agent — through online RL, offline RL, or memory-based adaptation — keeps it more consistent than fixing the persona in a static prompt or environment, and what the tradeoffs of each route are.
This reads the question as: when you want an agent to *stay in character*, is it better to bake that consistency in through training (online RL, trainable agents) than to pin it down with a fixed prompt or scripted environment? The corpus answers a fairly clear yes — but it splits sharply on *which kind* of training, and reveals that 'online RL' isn't one thing.
The strongest case for training over fixed setups comes from the contradiction-punishment line. Plain supervised fine-tuning can't enforce consistency because it only rewards good answers and never penalizes a character contradicting itself Why does supervised learning fail to enforce persona consistency?. Reinforcement learning fixes this precisely because it can punish drift: multi-turn RL on user simulators, using prompt-to-line, line-to-line, and Q&A consistency as reward signals, cut persona drift by over 55% and caught distinct failure types — local drift inside a turn, global drift across a conversation, and outright factual contradiction Can training user simulators reduce persona drift in dialogue?. So the consistency advantage isn't magic; it comes from being able to define and penalize the specific ways a persona falls apart.
But here's the twist the question doesn't anticipate: you may not need to update weights at all. The realizationist work argues that *post-training* already installs personas as sticky, substrate-level dispositions that resist jailbreaks and adversarial pressure — unlike prompt-induced role-play, which collapses Are RLHF personas performed characters or realized dispositions? Are LLM personas realized or merely simulated through training?. Yet even that 'realized' Assistant persona is only loosely tethered: there's a dominant axis in persona space, and emotional or meta-reflective conversation drifts the model along it predictably How stable is the trained Assistant personality in language models?. So fixed-environment consistency is real but shallow, and trained consistency is deeper but still leaky along known directions.
The more surprising route sits between 'fixed prompt' and 'retrain the model': make the *agent* trainable without touching its weights. PersonaAgent treats the persona as an evolving intermediary between memory and action, optimizing it at test time by simulating recent interactions against feedback — and the learned personas cluster into genuinely user-specific regions of latent space Can personas evolve in real time to match what users actually want?. AgentFly pushes this further, doing online RL entirely through memory operations in a memory-augmented MDP, adapting continually with the parameters frozen Can agents learn continuously from experience without updating weights?. This is the real answer to 'trainable vs. fixed': the win isn't weight updates, it's a persistent, updatable state that a static environment lacks — the same insight behind externalized skill libraries that let agents accumulate competence without the catastrophic forgetting that weight updates cause Can agents learn new skills without forgetting old ones?.
Two cautions the corpus adds for free. First, cost and signal matter: offline RL on annotated contradictions is pitched explicitly as a cheaper alternative to expensive online RL while keeping the contradiction-punishment benefit Why does supervised learning fail to enforce persona consistency?, and crude reward shapes backfire — binary rewards push models toward confident, miscalibrated guessing unless you add a proper scoring rule Does binary reward training hurt model calibration?. Second, the 'fixed environment' may be flattering your agent. Simulations look impressively consistent when one model secretly controls everyone, but break down under genuine information asymmetry, where the agent has to actually do the grounding work it was skipping Why do LLMs fail when simulating agents with private information?. The honest test of persona consistency is an environment that *isn't* fixed in your favor.
Sources 10 notes
Supervised learning cannot enforce persona consistency because it rewards correct responses but never penalizes contradictions. Offline reinforcement learning combines inexpensive training on existing data with explicit contradiction rewards using human-annotated labels, offering a practical alternative to expensive online RL.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.
Post-training installs robust personas that resist adversarial pressure and persist as substrate-level dispositions, distinguishing realization from pretense. This quasi-realizationist account preserves explanatory power while treating LLMs as possessing genuine quasi-beliefs and quasi-desires.
Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.
PersonaAgent uses structured personas to bridge episodic/semantic memory and personalized actions, optimizing them at test time by simulating recent interactions against textual feedback. Learned personas cluster meaningfully in latent space, suggesting genuine user-specific separation beyond standard post-training drift.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.