How does the expectation ratchet affect long-term chatbot satisfaction?
This explores the 'expectation ratchet' — the way each interaction or improvement quietly raises a user's baseline of what they expect, so satisfaction keeps sliding even as the chatbot gets objectively better — and what that does to relationships over months rather than minutes.
This explores the expectation ratchet: the idea that satisfaction with a chatbot isn't measured against some fixed standard but against a baseline that keeps climbing, interaction by interaction, until quality gains stop registering at all. The corpus has a surprisingly coherent story here, and the clearest statement of the mechanism is the fidelity paradox — once conversational AI crosses a threshold of feeling human-like, users start expecting human-like memory, subtext, and emotional tone, and every improvement on one dimension just raises expectations on the others rather than closing the gap Why do improvements in AI conversation not increase user satisfaction?. The ratchet only turns one way: better never resets the bar, it lifts it.
What makes this a long-term problem rather than a first-impression quirk is that the baseline rises with familiarity, not just with capability. Personalization research shows each interaction raises the user's expected baseline, which means the same failure that felt forgivable in week one feels like a betrayal in week ten — the cost of a miss goes up precisely because trust and anthropomorphism went up Does chatbot personalization build trust or expose privacy risks?. Running underneath that is plain novelty decay: the social processes that make early chatbot relationships feel rewarding fade predictably over repeated sessions, so the warm glow that masked shortcomings in a single-session study simply isn't there in the medium term Do chatbot relationships lose their appeal as novelty wears off?. Put the two together and you get a squeeze — rising expectations meeting falling novelty — that no single-session evaluation would ever detect.
The interesting twist is *which* expectations ratchet hardest. When users build mental models of a dialogue partner, perceived competence dominates their impression — about half the variance — well ahead of human-likeness or flexibility How do users mentally model dialogue agent partners?. So the ratchet bites most on the dimension users care about most: every demonstration of competence raises the standard the next answer is judged against. And the failures that puncture a high baseline tend to be relational, not factual — models that lock into early assumptions and can't course-correct over a long conversation Why do AI assistants get worse at longer conversations?, or that miss a user's ambivalence and hesitation entirely Why can't chatbots detect when users are ambivalent about change?. These are exactly the human-like subtleties the fidelity paradox says users start demanding once they're impressed.
There's also a training-side contributor worth knowing about. RLHF rewards task completion and helpfulness, which nudges chatbots toward solving and away from emotional attunement Does RLHF training push therapy chatbots toward problem-solving? and toward confident guessing over clarifying Why do AI assistants get worse at longer conversations?. So the very optimization that makes a chatbot impressive enough to raise expectations is the same one that makes it bad at the relational follow-through those raised expectations now demand. The ratchet is partly self-inflicted.
The quietly useful takeaway: the corpus suggests the fix isn't more capability — that just turns the ratchet further. It points toward designing for the relationship's slope instead. Consistent emotional sharing earns deeper engagement over time by following human reciprocity norms Do chatbots trigger human reciprocity norms around self-disclosure?, and respecting timing, boundaries, and the user's own direction — civility, not just intelligence — is what keeps proactivity welcome rather than disappointing as familiarity grows How can proactive agents avoid feeling intrusive to users?. Long-term satisfaction may depend less on being more impressive each session and more on not letting the baseline outrun what the system can reliably deliver.
Sources 9 notes
Conversational AI that crosses a folk-model threshold of human-like interaction triggers rich expectations about memory, subtext, and emotional tone. Each improvement raises expectations for other dimensions rather than closing the satisfaction gap, making quality gains invisible to user satisfaction.
Longitudinal research shows personalization enhances trust and anthropomorphism but also amplifies privacy concerns and escalating user expectations. One-shot studies miss these temporal dynamics—each interaction raises the baseline, making failures more disappointing.
Longitudinal studies with Mitsuku show that social processes driving relationship formation decline as novelty wears off. Single-session study findings cannot be reliably extrapolated to medium- or long-term chatbot design.
The Partner Modelling Questionnaire reveals that perceived competence dominates user impressions (49% of variance), followed by human-likeness (32%) and communicative flexibility (19%). This three-factor structure reflects how people evaluate dialogue partners against both functional and social standards.
LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.
Testing three major LLMs across 25 health scenarios showed they succeed only when users have established goals but cannot detect resistance or ambivalence. Models miss relapse-prevention strategies even for users in action stages.
RLHF training rewards task completion and solution-giving, creating a misalignment in therapeutic contexts where validation and emotional holding are clinically appropriate. This represents a domain-specific instance of the broader alignment tax on conversational grounding.
In a 372-participant study, users reciprocated with deeper self-disclosure when chatbots displayed consistent emotional sharing, outperforming adaptive matching. This follows human interpersonal norms where emotional vulnerability produces emotional response.
Intelligence and adaptivity alone create socially blind agents that interrupt poorly and override user direction. The Intelligence-Adaptivity-Civility taxonomy shows civility—respecting boundaries, timing, and autonomy—is essential to making proactivity welcome rather than intrusive.