How do user expectations change as chatbots remember more interactions?
This explores what happens to users' standards and demands over time as chatbots accumulate memory and history — not whether memory is technically useful, but how it reshapes what people come to want and how they react when it falls short.
This explores what happens to users' standards and demands over time as chatbots accumulate memory and history — and the corpus points to an uncomfortable answer: remembering more raises the bar faster than it can be met. The clearest articulation is the "fidelity paradox" — once a chatbot crosses a folk-model threshold of feeling human-like, users suddenly expect it to remember, catch subtext, and read emotional tone, and each improvement on one dimension just inflates expectations on the others rather than closing the satisfaction gap Why do improvements in AI conversation not increase user satisfaction?. Memory is precisely one of those triggering dimensions: the better the system recalls, the richer the model the user builds of what it *should* be able to do.
The longitudinal work makes this concrete. Personalization — which depends on remembering — is a double-edged escalation: it builds trust and anthropomorphism, but each interaction raises the baseline, so failures land harder and feel more disappointing than they would have in a one-shot encounter chatbot-personalization-creates-a-dual-dynamic-increasing-trust-and-anthropam (slug chatbot-personalization-creates-a-dual-dynamic-increasing-trust-and-anthropom). This is the temporal dynamic single-session studies miss entirely. And it cuts against a competing force: novelty. Relationship-formation processes with chatbots decay predictably as the novelty wears off Do chatbot relationships lose their appeal as novelty wears off? — so over many interactions you get rising expectations and falling enchantment at the same time, a squeeze that early enthusiasm masks.
Why do expectations rise toward memory specifically? Because users import human conversational norms. They reciprocate self-disclosure the way they would with a person Do chatbots trigger human reciprocity norms around self-disclosure?, and once you've disclosed something intimate, you expect it to be held and carried forward — the relational logic of conversation assumes continuity. Yet the maintenance skills that make human conversation feel continuous (reference repair, topic hand-off, picking up where you left off) are implicit social actions that models don't naturally develop, because training rewards predicting information, not relational upkeep Why don't language models develop conversation maintenance skills?. So the user's expectation of seamless memory collides with a system that treats memory as data retrieval rather than relational work.
There's a subtler turn worth knowing: what users trust isn't actually accuracy or even faithful recall — it's the *feel* of contingent, responsive interaction. Conversationality drives trust in ChatGPT largely independent of whether it's right Does conversational style actually make AI more trustworthy?, and users mentally model agents mostly on perceived competence and human-likeness How do users mentally model dialogue agent partners?. This means a chatbot that *performs* remembering well can ratchet expectations up faster than its actual memory fidelity justifies — which is exactly how you manufacture future disappointment.
The research-direction worth chasing is whether memory can be made to *evolve with* the user instead of just accumulating. PersonaAgent treats a persona as a living intermediary between memory and action, tuned at test time against recent interactions Can personas evolve in real time to match what users actually want?, and multi-turn RL on user simulators cuts persona drift by over 55% by rewarding consistency across turns Can training user simulators reduce persona drift in dialogue?. The bet behind these is that the real failure isn't forgetting — it's *inconsistency* across a long history, the thing that most violates the continuity users have quietly come to expect.
Sources 10 notes
Conversational AI that crosses a folk-model threshold of human-like interaction triggers rich expectations about memory, subtext, and emotional tone. Each improvement raises expectations for other dimensions rather than closing the satisfaction gap, making quality gains invisible to user satisfaction.
Longitudinal studies with Mitsuku show that social processes driving relationship formation decline as novelty wears off. Single-session study findings cannot be reliably extrapolated to medium- or long-term chatbot design.
In a 372-participant study, users reciprocated with deeper self-disclosure when chatbots displayed consistent emotional sharing, outperforming adaptive matching. This follows human interpersonal norms where emotional vulnerability produces emotional response.
Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.
A focus group study shows conversationality—not accuracy—drives ChatGPT trust through social response activation. Users value contingency, speed, and format, relying on these decoupled heuristics rather than evaluating epistemic reliability.
The Partner Modelling Questionnaire reveals that perceived competence dominates user impressions (49% of variance), followed by human-likeness (32%) and communicative flexibility (19%). This three-factor structure reflects how people evaluate dialogue partners against both functional and social standards.
PersonaAgent uses structured personas to bridge episodic/semantic memory and personalized actions, optimizing them at test time by simulating recent interactions against textual feedback. Learned personas cluster meaningfully in latent space, suggesting genuine user-specific separation beyond standard post-training drift.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.