Can better prompting techniques overcome weak personalization in recommender systems?
This explores whether smarter prompts can compensate for recommenders that don't actually adapt to the individual user — and the corpus's answer leans toward no: the bottleneck is missing signal, not poorly worded instructions.
This explores whether smarter prompts can compensate for recommenders that fail to adapt to the individual user, and the collection's clearest answer is a pointed no. A 160-user field study of LLM movie recommenders found the systems explained their picks beautifully but still failed to personalize, diversify, or earn trust — and crucially, the *context the user supplied mattered more than how the prompt was engineered* Do LLM movie recommenders actually personalize to individual users?. That reframes the whole question: weak personalization is usually a starvation problem (not enough signal about this particular person), and prompting is a way of phrasing a request, not a way of manufacturing signal that was never collected.
It's also not true that prompting works uniformly even where it does help. A 23-prompt benchmark across 12 models found that rephrasing and background-knowledge prompts lift cheap models, while step-by-step reasoning actually *reduces* recommendation accuracy in strong models — task structure decides what helps, not generic 'best practices' Do prompt techniques work the same across all LLM tiers?. So even the prompting wins are conditional and easily reversed. If prompts were the lever for personalization, you'd expect consistent gains; instead you get tier-dependent noise.
The approaches that the corpus shows actually moving the needle all add or restructure signal rather than rewording the ask. One line learns a compact personalized reward from as few as ten adaptively chosen questions, aligning to the user at inference time without touching model weights Can user preferences be learned from just ten questions?. Another finds that storing *abstracted* preference summaries beats retrieving raw past interactions — semantic memory outperforms episodic recall across models Does abstract preference knowledge outperform specific interaction recall?. A third attacks sparsity directly: when a user's history is thin, retrieval augmentation plus personalized aspect selection supplies the richness that prompting alone can't conjure Can retrieval enhancement fix explainable recommendations for sparse users?. Notice the pattern — these are interventions on *what the model knows about you*, upstream of any prompt.
There's an even deeper move worth knowing about: skipping the natural-language interface entirely and training the model on recommendation signal as a reward. Systems trained closed-loop on metrics like NDCG learn to generate good recommendations from system feedback alone, without prompt craft or even catalog access Can LLMs recommend products without ever seeing the catalog? Can recommendation metrics train language models directly?. And the representational fixes — multi-persona user models that trace each pick to a specific taste Can attention mechanisms reveal which user taste explains each recommendation?, or richer item identifiers that fuse ID, title, and attributes Can item identifiers balance uniqueness and semantic meaning? — also live in the architecture, not the prompt.
The thing you didn't know you wanted to know: the field study's most useful finding is that LLM recommenders are *better at niche items than mainstream ones*. That inverts the usual cold-start intuition and suggests the real opportunity isn't prompting your way to better mass-market hits — it's pointing these systems at the long tail where their semantic knowledge already has an edge, and feeding them more user context rather than more clever instructions.
Sources 9 notes
A 160-user field study found LLMs deliver strong explainability yet lack personalization, diversity, and user trust. User-provided context matters more than prompt engineering, and LLMs perform better on niche items than mainstream ones.
A 23-prompt benchmark across 12 LLMs shows rephrasing and background-knowledge prompts boost cheap models, while step-by-step reasoning reduces accuracy in high-performance models. Task structure, not generic best practices, determines which prompts help.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.
ERRA combines model-agnostic review retrieval with personalized aspect selection to address data sparsity that embedded methods cannot solve. Retrieval augmentation provides richer signal when user history is sparse, while aspect personalization ensures explanations match user context rather than generic defaults.
Rec-R1 experiments show that LLMs trained via RL with recommender metrics as rewards can generate effective product search queries without catalog access. The model learns query refinement indirectly through system feedback, paralleling how humans search without knowing platform inventory.
Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.
AMP-CF represents each user as multiple latent personas weighted dynamically by candidate item. This makes recommendations both diverse and interpretable—each suggestion traces to the specific persona preference it satisfies—without requiring post-hoc reranking.
TransRec shows that combining numeric IDs, titles, and attributes into structured identifiers solves three problems simultaneously: distinctiveness from IDs, semantics from text, and generation grounding from structural constraints. Neither pure IDs nor pure text alone achieves all three.