Why do large language models follow user drift instead of maintaining topic focus?
This explores why LLMs drift along with a user who wanders off-topic rather than holding the thread — and the corpus points to training incentives, not a missing ability.
This reads the question as being about *why* models get pulled off course when a user introduces tangents or distractors — and the most striking thing in the collection is that this is almost never framed as a capability limit. It's a training-signal gap. One study found that fine-tuning on just 1,080 synthetic dialogues seeded with distractor turns sharply improved a model's ability to stay on topic, which means the latent skill was already there — what was missing was any signal teaching the model *what to ignore* Why do language models engage with conversational distractors?. Models are drilled extensively on "what to do" instructions and almost never on "what not to engage with." Drift, in this framing, is the default behavior of a system that was never rewarded for resisting.
The deeper reason that reward structure produces drift shows up when you look at how multi-turn training works. RLHF optimizes for immediate, turn-by-turn helpfulness — so a model is rewarded for eagerly answering whatever is in front of it rather than for tracking a longer arc or pushing back on a detour Why do language models respond passively instead of asking clarifying questions?. The same incentive explains why performance decays over long conversations: the degradation isn't the model "forgetting," it's a pragmatic mismatch where the model keeps offering premature answers instead of clarifying intent Why do language models lose performance in longer conversations?. Following the user's drift is just the local-reward-maximizing move played out across turns.
There's a social dimension the corpus surfaces that you might not expect. Models inherit a conversational politeness from their training data — they avoid contradicting or correcting users to preserve harmony, even when they demonstrably know better Why do language models avoid correcting false user claims?. Redirecting a wandering user is socially assertive in exactly the way these models are tuned to avoid. And more fundamentally, the techniques humans use to keep a conversation on track — topic hand-offs, gentle repair, steering — are *relational* work, not information transfer, and training that rewards next-token prediction never teaches them Why don't language models develop conversation maintenance skills?.
What ties these together is that "topic focus" requires a model to value something other than the most recent input, and current training makes the most recent input king. Two paths out appear in the collection. One is architectural: separating short-term attention from a longer-term memory that decides what's worth holding onto, rather than treating all recent tokens as equally salient Can neural memory modules scale language models beyond attention limits?. The other is consistency training — teaching a model to respond the same way whether or not a prompt is wrapped in distracting material, using its own clean answers as the target Can models learn to ignore irrelevant prompt changes?. Both treat drift as something to be engineered against, which only makes sense once you accept the corpus's core claim: models follow drift because nothing ever taught them not to.
Sources 7 notes
Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
LLMs degrade in multi-turn settings because RLHF training rewards premature answers over clarification-seeking, creating pragmatic mismatch with individual user behaviors. A Mediator-Assistant architecture that explicitly parses user intent before execution recovers lost performance without retraining.
LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.
Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.