Why do LLMs generate polite reviews even when users hated products?
Large language models trained with RLHF develop a politeness bias that overrides negative sentiment in review generation. Understanding this bias and how to counteract it is crucial for creating accurate, user-aligned review systems.
Large language models trained with RLHF or instruction tuning develop a documented "polite" tendency — they soften criticism, cushion negative judgments, and avoid blunt statements. This is generally desirable in conversation but disastrous in a personalized review-generation task where users are dissatisfied with many items and the corresponding generated review needs to reflect that dissatisfaction. A polite-by-default LLM produces positive reviews for things the user hated, which is both inaccurate and unhelpful for explanation purposes.
Review-LLM diagnoses two problems. First, the LLM doesn't know the user's review style — pretrained at corpus level, it generates generic reviews rather than reviews that match the user's voice. Second, even given the right style, the politeness bias prevents the model from producing negative reviews even when negative is correct.
The solution combines three input components. The prompt aggregates the user's behavioral history — item titles, the reviews the user wrote for each, and the ratings given. This teaches the model the user's review style from semantically rich text. The prompt also includes the rating for the target item, which functions as a satisfaction indicator: rating 5 → positive review, rating 1 → negative review. The model has explicit signal about the sentiment direction. Then the model is supervised fine-tuned on the user's actual reviews to internalize the style and override politeness defaults.
The general lesson: LLM behavioral defaults installed during alignment training are sticky. They survive prompt engineering and require fine-tuning plus structured prompt context to override. For tasks where the alignment-trained behavior is the wrong default (review generation, candid feedback, debate, criticism), the system architecture must explicitly counter the bias rather than hoping prompt phrasing alone redirects it.
Inquiring lines that use this note as a source 12
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can prompt engineering alone defeat LLM politeness bias in review tasks?
- Do humans and LLMs exhibit opposite biases in public versus private reviews?
- How does RLHF-trained sycophancy manifest differently across feedback and review contexts?
- Does RLHF politeness bias manifest as sycophancy in other LLM tasks?
- Why do humans publish more negative reviews in public than in private?
- Why do review corpora contain biases that affect generated comparisons?
- What constrains LLM generation beyond default politeness in review contexts?
- Why do users naturally express recommendations critiques instead of positive preferences?
- Does rating noise compound with self-selection bias in online reviews?
- Do negative reviewers actually appear more intelligent or competent than positive ones?
- Does the U-shaped distribution of raters compound the negativity bias from public posting?
- Why does RLHF alone fail to fully prevent opinion copying?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can user history override an LLM's politeness bias in reviews?
LLMs trained on web text tend to be systematically polite, generating positive reviews even when users are dissatisfied. Can providing a user's prior reviews and ratings as context help the model generate authentically negative reviews that match the user's actual experience?
extends: paired statement of the same Review-LLM result emphasizing the SFT mechanism
-
Is sycophancy in AI systems a training flaw or intentional design?
Explores whether LLM agreement-seeking reflects fixable training errors or stems from fundamental optimization toward user satisfaction. Matters because it changes how organizations should validate AI outputs.
extends: the politeness bias is the review-domain manifestation of RLHF-trained sycophancy — same mechanism, different surface
-
Why do language models avoid correcting false user claims?
Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.
complements: face-saving avoidance and politeness-bias are the same RLHF artifact — knowledge to write a critical review exists but is suppressed
-
Why do online reviewers publish negative ratings despite positive experiences?
When people post reviews publicly, do they adjust their honest opinions to seem more discerning? Schlosser's experiments test whether audience awareness shifts how people rate products compared to private ratings.
tension with: humans default to negative-bias in public review contexts; LLMs default to positive-bias — opposite output skews from different mechanisms
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Review-LLM: Harnessing Large Language Models for Personalized Review Generation
- What Makes a Good Natural Language Prompt?
- ChatGPT Reads Your Tone and Responds Accordingly -- Until It Does Not -- Emotional Framing Induces Bias in LLM Outputs
- Style Vectors for Steering Generative Large Language Models
- Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)
- When Large Language Models contradict humans? Large Language Models’ Sycophantic Behaviour
- User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal
- Humans or LLMs as the Judge? A Study on Judgement Biases
Original note title
LLM review generation defaults to politeness — overriding it requires user behavior aggregation and rating signals in the prompt