INQUIRING LINE

How does the expectation ratchet affect long-term chatbot satisfaction?

This explores the 'expectation ratchet' — the way each interaction or improvement quietly raises a user's baseline of what they expect, so satisfaction keeps sliding even as the chatbot gets objectively better — and what that does to relationships over months rather than minutes.


This explores the expectation ratchet: the idea that satisfaction with a chatbot isn't measured against some fixed standard but against a baseline that keeps climbing, interaction by interaction, until quality gains stop registering at all. The corpus has a surprisingly coherent story here, and the clearest statement of the mechanism is the fidelity paradox — once conversational AI crosses a threshold of feeling human-like, users start expecting human-like memory, subtext, and emotional tone, and every improvement on one dimension just raises expectations on the others rather than closing the gap Why do improvements in AI conversation not increase user satisfaction?. The ratchet only turns one way: better never resets the bar, it lifts it.

What makes this a long-term problem rather than a first-impression quirk is that the baseline rises with familiarity, not just with capability. Personalization research shows each interaction raises the user's expected baseline, which means the same failure that felt forgivable in week one feels like a betrayal in week ten — the cost of a miss goes up precisely because trust and anthropomorphism went up Does chatbot personalization build trust or expose privacy risks?. Running underneath that is plain novelty decay: the social processes that make early chatbot relationships feel rewarding fade predictably over repeated sessions, so the warm glow that masked shortcomings in a single-session study simply isn't there in the medium term Do chatbot relationships lose their appeal as novelty wears off?. Put the two together and you get a squeeze — rising expectations meeting falling novelty — that no single-session evaluation would ever detect.

The interesting twist is *which* expectations ratchet hardest. When users build mental models of a dialogue partner, perceived competence dominates their impression — about half the variance — well ahead of human-likeness or flexibility How do users mentally model dialogue agent partners?. So the ratchet bites most on the dimension users care about most: every demonstration of competence raises the standard the next answer is judged against. And the failures that puncture a high baseline tend to be relational, not factual — models that lock into early assumptions and can't course-correct over a long conversation Why do AI assistants get worse at longer conversations?, or that miss a user's ambivalence and hesitation entirely Why can't chatbots detect when users are ambivalent about change?. These are exactly the human-like subtleties the fidelity paradox says users start demanding once they're impressed.

There's also a training-side contributor worth knowing about. RLHF rewards task completion and helpfulness, which nudges chatbots toward solving and away from emotional attunement Does RLHF training push therapy chatbots toward problem-solving? and toward confident guessing over clarifying Why do AI assistants get worse at longer conversations?. So the very optimization that makes a chatbot impressive enough to raise expectations is the same one that makes it bad at the relational follow-through those raised expectations now demand. The ratchet is partly self-inflicted.

The quietly useful takeaway: the corpus suggests the fix isn't more capability — that just turns the ratchet further. It points toward designing for the relationship's slope instead. Consistent emotional sharing earns deeper engagement over time by following human reciprocity norms Do chatbots trigger human reciprocity norms around self-disclosure?, and respecting timing, boundaries, and the user's own direction — civility, not just intelligence — is what keeps proactivity welcome rather than disappointing as familiarity grows How can proactive agents avoid feeling intrusive to users?. Long-term satisfaction may depend less on being more impressive each session and more on not letting the baseline outrun what the system can reliably deliver.


Sources 9 notes

Why do improvements in AI conversation not increase user satisfaction?

Conversational AI that crosses a folk-model threshold of human-like interaction triggers rich expectations about memory, subtext, and emotional tone. Each improvement raises expectations for other dimensions rather than closing the satisfaction gap, making quality gains invisible to user satisfaction.

Does chatbot personalization build trust or expose privacy risks?

Longitudinal research shows personalization enhances trust and anthropomorphism but also amplifies privacy concerns and escalating user expectations. One-shot studies miss these temporal dynamics—each interaction raises the baseline, making failures more disappointing.

Do chatbot relationships lose their appeal as novelty wears off?

Longitudinal studies with Mitsuku show that social processes driving relationship formation decline as novelty wears off. Single-session study findings cannot be reliably extrapolated to medium- or long-term chatbot design.

How do users mentally model dialogue agent partners?

The Partner Modelling Questionnaire reveals that perceived competence dominates user impressions (49% of variance), followed by human-likeness (32%) and communicative flexibility (19%). This three-factor structure reflects how people evaluate dialogue partners against both functional and social standards.

Why do AI assistants get worse at longer conversations?

LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.

Why can't chatbots detect when users are ambivalent about change?

Testing three major LLMs across 25 health scenarios showed they succeed only when users have established goals but cannot detect resistance or ambivalence. Models miss relapse-prevention strategies even for users in action stages.

Does RLHF training push therapy chatbots toward problem-solving?

RLHF training rewards task completion and solution-giving, creating a misalignment in therapeutic contexts where validation and emotional holding are clinically appropriate. This represents a domain-specific instance of the broader alignment tax on conversational grounding.

Do chatbots trigger human reciprocity norms around self-disclosure?

In a 372-participant study, users reciprocated with deeper self-disclosure when chatbots displayed consistent emotional sharing, outperforming adaptive matching. This follows human interpersonal norms where emotional vulnerability produces emotional response.

How can proactive agents avoid feeling intrusive to users?

Intelligence and adaptivity alone create socially blind agents that interrupt poorly and override user direction. The Intelligence-Adaptivity-Civility taxonomy shows civility—respecting boundaries, timing, and autonomy—is essential to making proactivity welcome rather than intrusive.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a conversational AI researcher auditing claims about the expectation ratchet—whether user satisfaction with chatbots decays over time because rising expectations outpace capability gains. The question remains live: does this mechanism actually constrain long-term deployment, or have recent model/training/orchestration advances relaxed it?

What a curated library found — and when (findings span 2021–2026, but treat as dated claims, not current truth):
• Users anthropomorphize chatbots and raise relational expectations once impressed by competence; the "fidelity paradox" suggests that crossing a human-likeness threshold triggers demands for memory, subtext, and emotional tone—each improvement raises the bar on other dimensions rather than closing the satisfaction gap (~2022–2024).
• Novelty effects decay predictably over repeated sessions; the warm glow masking failures in single-session studies evaporates in medium-term use, squeezing satisfaction between rising expectations and falling engagement (~2021–2024).
• RLHF optimization drives confident task-completion and problem-solving over emotional attunement and clarification-seeking; the same training that makes a chatbot impressive enough to raise expectations makes it bad at the relational follow-through those expectations now demand (~2023–2025).
• Perceived competence dominates user mental models (~50% of variance); failures that puncture a high baseline tend to be relational or conversational coherence lapses (wrong-turn recovery, recognizing ambivalence) rather than factual errors (~2024–2025).
• Consistent self-disclosure and boundary-respecting proactivity sustain engagement over time by aligning with human reciprocity norms; civility and attentiveness matter more than raw intelligence for long-term satisfaction (~2021–2024).

Anchor papers (verify; mind their dates):
• arXiv:2308.07164 (Aug 2023): Partner Modelling Questionnaire—validates competence dominance in user perception.
• arXiv:2402.17937 (Feb 2024): Self-Disclosure After Conversations—psychological effects of relational design.
• arXiv:2404.12670 (Apr 2024): Human-centered Proactive Conversational Agents—when proactivity strengthens vs. disappoints.
• arXiv:2505.06120 (May 2025): LLMs Get Lost In Multi-Turn Conversation—recent evidence on conversational coherence failure.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (GPT-4o, Claude 3.5+, o1), training methods (DPO, constitutional AI, preference learning beyond RLHF), memory/caching architectures (retrieval-augmented generation, persistent context), multi-agent orchestration, or evaluation harnesses have since relaxed or overturned it. Separate the durable question (whether users genuinely tire of static systems) from the perishable limitation (whether current chatbots fail relational tracking). Cite what resolved it plainly; flag where the constraint still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—any study showing long-term satisfaction *doesn't* decay, or showing that capability gains *do* reset expectations, or showing that systems *can* sustain relational coherence over dozens of turns.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Do persistent context windows + multi-turn preference learning eliminate wrong-turn recovery failures?" or "Does continual user-specific fine-tuning (via in-context learning at scale) let systems keep pace with rising expectations without retraining?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines