INQUIRING LINE

What makes prompts and retrieval insufficient for real personalization?

This explores why the two most common personalization shortcuts — stuffing context into prompts and retrieving a user's past interactions — fall short of capturing who a user actually is, and what the corpus suggests works better.


This explores why prompts and retrieval — the two most common personalization shortcuts — fall short, and what fills the gap. The short version from the corpus: both treat personalization as a lookup problem (find the right past text, paste it in) when it's really an abstraction problem (learn the durable shape of someone's preferences). Retrieval pulls specific past interactions, but abstract preference summaries consistently beat that episodic recall across models — and oddly, recency-based recall beats similarity-based retrieval, which means the very mechanism retrieval relies on (find the most similar past thing) is the weak link Does abstract preference knowledge outperform specific interaction recall?.

Worse, retrieval's similarity matching has an active failure mode, not just a ceiling. When the system grabs a profile that's *almost* but not truly the user, errors get worse than an obvious mismatch — a U-shaped curve where the model confidently applies nearly-right preferences, an uncanny valley of personalization Why do similar user profiles produce worse personalization errors?. So 'retrieve the most similar user' can be the most dangerous move available.

Prompts have their own walls. A 160-user field study of LLM recommenders found they explain themselves beautifully but barely personalize, and that user-provided context mattered more than any prompt engineering Do LLM movie recommenders actually personalize to individual users?. Prompt techniques don't even transfer across models — step-by-step reasoning that helps a cheap model can *reduce* accuracy in a strong one, so there's no stable 'best prompt' to lean on Do prompt techniques work the same across all LLM tiers?. And generic chain-of-thought reasoning, prompted in, actually underperforms for personalization because it reasons without the user in mind Why does chain-of-thought reasoning fail for personalization?.

The deeper reason both fall short: what personalizes a person isn't the content of what they typed but their style and preferences. Profiles built from a user's *outputs* match or beat full profiles, while input-only profiles actively degrade performance — meaning the queries you'd retrieve carry the wrong signal Do user outputs outperform inputs for LLM personalization?. Real preferences also live at a timescale prompts and snippets can't see: most users pursue interest 'journeys' lasting over a month, like 'designing hydroponic systems for small spaces,' that collaborative filtering and one-shot retrieval completely miss Can language models discover what users actually want from activity logs?.

So the corpus points away from text-in-context toward *learned, compressed representations* of a user. User embeddings distilled from interaction history outperform text prompts on long histories while being cheaper Can user embeddings personalize language models more efficiently than prompts?; jointly-trained text summaries condition reward models better than raw embeddings and stay human-readable Can text summaries beat embeddings for personalized reward models?; preferences can be factored into a handful of coefficients inferred from ten adaptive questions Can user preferences be learned from just ten questions?; and traits can be written into the architecture itself via lightweight adapters that bypass prompt resistance entirely Can we control personality in language models without prompting?. The through-line: prompts and retrieval move text around; real personalization requires the model to *learn an abstraction of the user* and carry it, not look it up each turn.


Sources 11 notes

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

Why do similar user profiles produce worse personalization errors?

PRIME shows a U-shaped error curve where most-similar profile replacements cause steepest performance drops. The model confidently applies wrong preferences when profiles are nearly but not truly matched, an uncanny valley effect more harmful than obvious mismatch.

Do LLM movie recommenders actually personalize to individual users?

A 160-user field study found LLMs deliver strong explainability yet lack personalization, diversity, and user trust. User-provided context matters more than prompt engineering, and LLMs perform better on niche items than mainstream ones.

Do prompt techniques work the same across all LLM tiers?

A 23-prompt benchmark across 12 LLMs shows rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning reduces accuracy in high-performance models. Task structure, not generic best practices, determines which prompts help.

Why does chain-of-thought reasoning fail for personalization?

Generic chain-of-thought underperforms for personalization because it ignores user context. Fine-tuning destroys reasoning capacity entirely. Self-distillation lets models generate customized thinking traces that maintain both depth and relevance.

Do user outputs outperform inputs for LLM personalization?

Research shows that user profiles built from outputs alone match or exceed performance of complete profiles across multiple tasks, while input-only profiles degrade performance. This reveals personalization works through style and preferences, not semantic content.

Can language models discover what users actually want from activity logs?

66% of users pursue valued interest journeys lasting over a month, described in specific phrases like 'designing hydroponic systems for small spaces.' LLM-powered journey discovery bridges the semantic gap that collaborative filtering cannot reach, operating at user-level granularity with persona-level precision.

Can user embeddings personalize language models more efficiently than prompts?

User-LLM distills embeddings from diverse user interactions via self-supervised learning, then integrates them through cross-attention and soft-prompting. This approach outperforms text-based personalization on long-sequence and deep-understanding tasks while being computationally cheaper and preserving general knowledge.

Can text summaries beat embeddings for personalized reward models?

PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Next inquiring lines