INQUIRING LINE

How do personalization errors differ from general accuracy problems in summaries?

This explores what makes a summary fail *for a particular person* — getting the wrong things or mis-attributing them — versus failing on plain truthfulness, and why the corpus treats these as two different problems with different fixes.


This explores how personalization errors (a summary that's true but wrong *for you*) differ from general accuracy errors (a summary that's simply false or ungrounded). The corpus draws a sharper line between these than you might expect. A summary can be factually flawless and still fail an individual: in a user study of meeting summaries, the central complaint wasn't that the model lied — it was that the system summarized *global* importance instead of what mattered to the specific reader, and that speaker mis-attributions damaged group trust and accountability even when the underlying facts were right Why do LLM meeting summaries fail to help individuals?. That's the signature of a personalization error: the content is accurate, but the relevance and the 'who said what' are calibrated to the wrong person.

The most striking difference is the *shape* of the error. General accuracy failures tend to get worse the further the model drifts from its evidence — hallucination rises as sources degrade Can RAG systems refuse to answer without reliable evidence?, and models override their own context when training priors are strong Why do language models ignore information in their context?. Personalization errors follow an *inverted* curve. One study found a U-shaped error pattern where the worst mistakes come not from a totally wrong user profile but from one that's *almost* right — the model confidently applies nearly-matched preferences, an uncanny-valley effect more damaging than obvious mismatch Why do similar user profiles produce worse personalization errors?. So accuracy errors scale with distance from the truth; personalization errors spike with deceptive *closeness* to the right person.

They also live in different layers of the content. Accuracy is about semantic facts; personalization, it turns out, is mostly about style and preference. Profiles built from a user's past *outputs* match or beat full profiles, while profiles built from their *inputs* actively hurt — suggesting personalization rides on how someone writes and what they prefer, not on the topical content of their queries Do user outputs outperform inputs for LLM personalization?. And abstracted preference summaries beat literal recall of past interactions Does abstract preference knowledge outperform specific interaction recall?. This means a personalization error isn't a missing fact you could retrieve — it's a misread of taste, which retrieval alone can't fix.

Because of that, the *fixes* diverge. Accuracy problems are addressed by grounding and refusal — constrain the model to only say what the evidence supports Can RAG systems refuse to answer without reliable evidence?. Personalization problems get fixed by aligning the summary to a *downstream goal* or a *learned model of the person*: training summarizers against ranking rewards so they emphasize the attributes a user actually acts on Can reinforcement learning align summarization with ranking goals?, or learning text preference summaries that condition a reward model and capture dimensions a generic summary misses Can text summaries beat embeddings for personalized reward models?. A grounded-but-generic summary passes the accuracy test and fails the personalization test.

The thing worth carrying away: personalization errors are *relational and confidence-amplified* in a way accuracy errors aren't. A false fact is usually flagged as a hallucination; a confidently mis-personalized summary feels authoritative precisely because it's well-formed and almost-right — and downstream, people rarely catch it, editing AI text only 23% of the time before it reaches an audience Do writers actually edit AI-generated text before publishing?. The accuracy bug announces itself; the personalization bug hides inside a fluent, trustworthy-looking summary aimed at the wrong reader.


Sources 9 notes

Why do LLM meeting summaries fail to help individuals?

A user study of seven participants found three critical failures: systems summarize global importance rather than individual relevance, mis-attributions damage group trust and accountability, and one format cannot serve both quick scanning and detailed reference needs.

Why do similar user profiles produce worse personalization errors?

PRIME shows a U-shaped error curve where most-similar profile replacements cause steepest performance drops. The model confidently applies wrong preferences when profiles are nearly but not truly matched, an uncanny valley effect more harmful than obvious mismatch.

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do user outputs outperform inputs for LLM personalization?

Research shows that user profiles built from outputs alone match or exceed performance of complete profiles across multiple tasks, while input-only profiles degrade performance. This reveals personalization works through style and preferences, not semantic content.

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

Can reinforcement learning align summarization with ranking goals?

ReLSum trains summarizers using downstream relevance scores as RL rewards, producing dense, attribute-focused summaries instead of fluent prose. This alignment to the actual ranking metric improves recall, NDCG, and user engagement in production e-commerce search.

Can text summaries beat embeddings for personalized reward models?

PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.

Do writers actually edit AI-generated text before publishing?

Writers edited AI-generated paragraphs only 23% of the time, with edits averaging 96% similarity to the original. This means AI's opinionated and distorted voice propagates with minimal human filtering before publication.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a synthesis researcher re-testing claims about personalization vs. accuracy errors in LLM summaries. The question remains open: do personalization failures follow fundamentally different patterns than factual hallucinations, and if so, what fixes actually work?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable benchmarks:
• Personalization errors follow a U-shaped curve, spiking when user profiles are *almost* right (~2024–2025), whereas accuracy errors scale monotonically with distance from evidence.
• Speaker mis-attribution and relevance miscalibration (both accurate facts, wrong person) are the signature failure modes, distinct from hallucination (~2023–2024).
• User-output-derived profiles outperform input-derived ones; semantic preference abstraction beats episodic recall (~2024–2025).
• RL-trained query-relevant summaries and learned text-based preference models outperform grounding alone for personalization (~2025–2026).
• Only ~23% edit rates on AI text suggest personalization bugs hide inside fluent, near-correct summaries (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2307.15793 (2023) — meeting summaries, speaker attribution failures
• arXiv:2406.17803 (2024) — user profile role in LLM personalization
• arXiv:2507.13579 (2025) — RL fine-tuned pluralistic preference summaries
• arXiv:2604.22503 (2026) — persona distortions in AI writing assistance

Your task:
(1) RE-TEST THE U-SHAPED CURVE. Has confidence-amplified near-miss personalization indeed remained harder to catch than factual errors in newer models (GPT-4o, Claude 3.5, Llama 3.x)? Or has better calibration, chain-of-thought, or multimodal grounding *compressed* this gap? Where does the accuracy–personalization split still hold?
(2) Surface the strongest *DISAGREEMENT*: Are there recent papers arguing that personalization errors are merely a *subcategory* of hallucination (confident false relevance judgments), not a distinct failure mode? What would reconcile that view with the U-shaped finding?
(3) Propose two research questions that assume the regime may have shifted:
   a) If in-context preference conditioning now outperforms learned profiles, does the edit-rate gap (23%) still predict real-world harm?
   b) Can adversarially near-miss personalization be detected *before deployment* via synthetic preference perturbation?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines