What makes historical user outputs more effective for personalization than semantic similarity?
This explores why a user's past *outputs* (what they wrote or produced) personalize better than retrieving past content by semantic similarity (matching on topic or meaning) — and what that reveals about what personalization is actually keyed on.
This explores why a user's past *outputs* personalize a model better than pulling up past material that's semantically similar to the current query — and the corpus has a surprisingly consistent answer: personalization is mostly about *style and preference*, not subject matter. The core finding is that profiles built from a user's outputs alone match or beat full profiles, while input-only profiles actually make things worse Do user outputs outperform inputs for LLM personalization?. The reason outputs win is that they carry *how* a person likes things expressed and decided — their voice, their taste — whereas inputs and semantically-matched documents carry *what* a topic is about. Topic is the thing you least need help with; the model can already handle content.
That reframing explains why similarity-based retrieval keeps underperforming. When you retrieve the most semantically similar past interaction, you're optimizing for the wrong axis. One striking result: recency beats similarity, and abstract preference *summaries* beat recalling specific past interactions at all Does abstract preference knowledge outperform specific interaction recall?. Pushed further, text-based preference summaries condition a reward model better than embedding vectors do — the dimensions that matter for taste don't survive being squashed into a similarity space Can text summaries beat embeddings for personalized reward models?.
The most counterintuitive piece is that similarity isn't just neutral — it can be actively harmful. There's a U-shaped error curve where replacing a user's profile with the *most similar* other user produces the worst errors, worse than an obvious mismatch. The model confidently applies almost-right preferences, an uncanny-valley effect Why do similar user profiles produce worse personalization errors?. So 'close in semantic space' is precisely the failure zone, because nearness on content masks divergence on preference.
What actually works is a different cut at the user. Some methods infer a compact preference structure — ten adaptive questions can pin down a user's reward coefficients without touching model weights Can user preferences be learned from just ten questions?. Others find that users aren't a single taste vector at all but a *mix of personas*, weighted by what's being recommended right now — which improves accuracy and explains itself for free Can modeling multiple user personas improve recommendation accuracy?. And LLMs reading raw activity can surface persistent 'interest journeys' — like 'designing hydroponic systems for small spaces' — that pure similarity-based collaborative filtering completely misses Can language models discover what users actually want from activity logs?.
The thread tying these together: similarity retrieves *content like this*, but personalization needs *a person like you* — and a person is better described by what they've produced, summarized into stable preferences, than by a cloud of topically-adjacent documents. The thing you'd think is the obvious lever (find the closest match) turns out to be the trap; the abstraction over your own outputs is the lever that works Why does chain-of-thought reasoning fail for personalization?.
Sources 8 notes
Research shows that user profiles built from outputs alone match or exceed performance of complete profiles across multiple tasks, while input-only profiles degrade performance. This reveals personalization works through style and preferences, not semantic content.
PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.
PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.
PRIME shows a U-shaped error curve where most-similar profile replacements cause steepest performance drops. The model confidently applies wrong preferences when profiles are nearly but not truly matched, an uncanny valley effect more harmful than obvious mismatch.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
AMP-CF separates user representation into latent personas weighted by attention to the candidate item. This candidate-conditional approach improves accuracy by adapting the user representation at prediction time and produces inherent explanations for why items were recommended.
66% of users pursue valued interest journeys lasting over a month, described in specific phrases like 'designing hydroponic systems for small spaces.' LLM-powered journey discovery bridges the semantic gap that collaborative filtering cannot reach, operating at user-level granularity with persona-level precision.
Generic chain-of-thought underperforms for personalization because it ignores user context. Fine-tuning destroys reasoning capacity entirely. Self-distillation lets models generate customized thinking traces that maintain both depth and relevance.