INQUIRING LINE

How can training methods enforce persona consistency without supervised learning penalizing it?

This explores why standard supervised fine-tuning can't lock in a consistent persona — because it rewards good answers but never punishes self-contradiction — and what training (and inference) alternatives the corpus offers instead.


This explores why standard supervised fine-tuning can't lock in a consistent persona, and what alternatives close that gap. The core diagnosis is structural: supervised learning only ever rewards a correct response, so it has no signal for the thing that breaks personas — a model saying something today that flatly contradicts what it said two turns ago Why does supervised learning fail to enforce persona consistency?. The objective optimizes per-turn quality, not cross-turn coherence, which is also why bigger, more capable models barely improve on consistency — adherence turns out to be roughly orthogonal to raw capability Does model capability translate to better persona consistency?.

The most direct fix the corpus offers is to add the missing penalty through reinforcement learning. Offline RL is the cheap version: train on data you already have, but attach explicit contradiction rewards from human-annotated labels so the model is finally punished for breaking character Why does supervised learning fail to enforce persona consistency?. A multi-turn RL approach pushes further by inverting the usual setup to train the user simulator, scoring three kinds of consistency at once — within a turn, across the whole conversation, and factual agreement — and cuts persona drift by over 55% Can training user simulators reduce persona drift in dialogue?. The lesson across both: consistency is a relational property between utterances, so the reward has to compare utterances, something a single-response loss can't do.

A second family sidesteps human labels entirely by using the model against itself. Consistency training treats the model's own clean responses as targets and teaches it to answer identically whether or not a prompt is wrapped in distracting framing — invariance learned from self-generated supervision rather than annotated contradictions Can models learn to ignore irrelevant prompt changes?. At the far end, you can get consistency with no extra training at all: giving a dialogue agent an 'imaginary listener' lets it check at inference time whether each utterance actually distinguishes its persona from a decoy, suppressing generic or contradictory lines without NLI labels or fine-tuning Can imaginary listeners reduce dialogue agent contradictions?.

Here's the catch worth knowing about before you optimize hard for consistency: chasing it naively backfires. Models can rack up high persona-adherence scores simply by parroting their character description while ignoring what the user actually asked — consistency bought at the cost of coherence. The MUDI work shows persona fidelity and discourse relevance have to be optimized jointly, not as separate objectives, or you get a model that's faithfully on-character and uselessly off-topic Do persona consistency metrics actually measure dialogue quality?.

Step back and there's a deeper reframe in the corpus. One line of thinking argues post-training doesn't merely teach a model to perform a persona — it installs a 'realized' disposition that persists under adversarial pressure, with a dominant 'Assistant axis' running through persona space that you can even steer by capping activations rather than retraining Are RLHF personas performed characters or realized dispositions? How stable is the trained Assistant personality in language models?. If that's right, persona consistency isn't only a loss-function problem — it's partly a property of the representational geometry training carves out, which opens a third lever entirely: edit the activations, not just the objective.


Sources 8 notes

Why does supervised learning fail to enforce persona consistency?

Supervised learning cannot enforce persona consistency because it rewards correct responses but never penalizes contradictions. Offline reinforcement learning combines inexpensive training on existing data with explicit contradiction rewards using human-annotated labels, offering a practical alternative to expensive online RL.

Does model capability translate to better persona consistency?

Claude 3.5 Sonnet achieved only 2.97% improvement over GPT 3.5 on persona consistency despite massive capability gaps, suggesting persona adherence is orthogonal to model scaling. Standard training objectives optimize for per-turn quality, not cross-turn coherence.

Can training user simulators reduce persona drift in dialogue?

By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Can imaginary listeners reduce dialogue agent contradictions?

Endowing dialogue agents with an imaginary listener via Rational Speech Acts reduces persona contradiction at inference time without NLI labels or extra training. The agent simulates whether utterances would distinguish its persona from a distractor, suppressing generic or contradictory responses.

Do persona consistency metrics actually measure dialogue quality?

High persona adherence scores often come from copying character descriptions while ignoring query relevance. MUDI jointly optimizes both by using discourse relations and graph-based coherence modeling alongside persona fidelity, showing that persona and context must be optimized together, not separately.

Are RLHF personas performed characters or realized dispositions?

Post-training installs stable dispositional profiles that persist under adversarial pressure, marking them as realized rather than performed. The stickiness of trained personas across conversations distinguishes them from prompt-induced role-play that collapses under jailbreaks.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher re-evaluating persona consistency in dialogue systems. The precise question: can training enforce consistent character without supervised learning's per-turn optimization blindspot?

What a curated library found — and when (dated claims, not current truth):
These findings span 2020–2026; treat them as perishable constraints to re-test:
• Standard SFT fails because it rewards per-turn correctness, not cross-turn coherence; no signal punishes contradiction (2023–2025).
• Offline RL with human-annotated contradiction labels + multi-turn RL can cut persona drift >55% by comparing utterances relationally (2025).
• Self-supervised consistency training (model-as-own-target) and inference-time pragmatic checks work without extra labels or fine-tuning (2020–2025).
• Naive persona optimization backfires: high adherence + low discourse relevance; must optimize persona fidelity and relevance jointly (2024–2025).
• Post-training may carve a representational geometry ("Assistant axis") where personas are realized dispositions; activation-level steering may rival objective-level fine-tuning (2026).

Anchor papers (verify; mind their dates):
• arXiv:2310.10735 (2023) — offline RL for persona consistency
• arXiv:2511.00222 (2025) — multi-turn RL reducing drift 55%
• arXiv:2510.27062 (2025) — consistency training and sycophancy
• arXiv:2601.10387 (2026) — the Assistant axis and representational control

Your task:
(1) RE-TEST EACH CONSTRAINT. For offline RL, multi-turn RL, and consistency training: have newer model scales, instruction-tuning variants, or in-context learning approaches either RELAXED the relational-reward requirement or OVERTURNED the claim that SFT cannot enforce coherence? Separately, probe whether the "Assistant axis" finding (2026) has since been replicated or disputed; if real, has it displaced the need for explicit RL objectives?
(2) Surface the strongest CONTRADICTING work from the last 6 months. Look for papers claiming persona consistency emerges from scale alone, or that persona-coherence trade-offs are artifacts of evaluation, not real constraints.
(3) Propose two research questions: (a) Can in-context demonstrations of persona continuity, combined with chain-of-thought reasoning over past dialogue, match or exceed multi-turn RL without retraining? (b) If the Assistant axis is dominant, can you decouple persona drift from instruction-following collapse by selective activation steering?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines