INQUIRING LINE

How does pretraining corpus popularity bias affect LLM recommendation behavior?

This explores how an LLM's recommendations skew toward items that were popular in its pretraining text — not items popular in the actual dataset it's deployed on — and what that means for using LLMs as recommenders.


This explores how an LLM's recommendations skew toward whatever was popular in its training corpus rather than in the data it's actually serving. The sharpest finding here is that popularity bias in LLM recommenders doesn't come from the interaction data you feed it — it's baked in during pretraining. GPT-4, for instance, keeps recommending The Shawshank Redemption across wildly different datasets, even ones with completely different popularity distributions, because that title is over-represented in the text it learned from Where does LLM recommendation bias actually come from?. This is a domain-shift problem: the model is recommending the world's popular items, not your catalog's popular items, and standard debiasing methods built for collaborative filtering don't touch it.

That single failure is part of a broader pattern. LLM recommenders inherit a whole family of biases from pretraining — position bias, popularity bias, and fairness bias — that stem from the language model's objective and the demographics of its corpus rather than from any user-interaction signal Where do recommendation biases come from in language models?. And this isn't unique to recommendation: causal experiments show that cognitive biases in general are planted during pretraining and only nudged by finetuning. Models sharing a pretrained backbone show the same bias fingerprints no matter what instruction data you tune them on Where do cognitive biases in language models come from?. So if you're hoping to finetune the popularity skew away, the evidence says you're working at the wrong layer.

The interesting turn is what the corpus suggests doing about it. One school of thought says: stop asking the LLM to rank at all. LLMs are great at understanding content but carry this baked-in ranking bias, so using them to enrich item descriptions — paraphrases, summaries, categories — and feeding that to a traditional recommender actually beats letting the LLM recommend directly Does LLM input augmentation beat direct LLM recommendation?. The LLM's text understanding is the asset; its predictions are the liability.

A second school closes the loop with reinforcement learning. Instead of trusting the model's pretrained priors, you train it directly against recommendation metrics like NDCG and Recall as black-box rewards, which pulls behavior toward the actual target catalog rather than the corpus's celebrity items Can recommendation metrics train language models directly?. Strikingly, models trained this way learn implicit catalog awareness — they generate effective product queries without ever seeing the inventory, much as a person searches a store without knowing its full stock Can LLMs recommend products without ever seeing the catalog?.

Worth knowing as a kicker: popularity bias rarely travels alone. The same pretraining-origin story produces persuasion biases (LLMs lean on logical, quantitative framing in nearly every exchange, lending recommendations unearned authority Do LLMs persuade users more often than humans do?) and citation-trust effects (users prefer answers with more citations even when those citations are irrelevant Do users trust citations more when there are simply more of them?). A popularity-biased recommendation delivered with confident, well-cited prose is doubly hard for a user to push back on — the bias and the persuasiveness reinforce each other.


Sources 8 notes

Where does LLM recommendation bias actually come from?

GPT-4 concentrates recommendations on items popular in its pretraining corpus rather than in target datasets. The Shawshank Redemption dominates across different datasets even when they have different popularity distributions, revealing a domain-shift effect that standard debiasing methods cannot address.

Where do recommendation biases come from in language models?

Wu et al. show that LLM-based recommendation systems exhibit position bias, popularity bias, and fairness bias—unique failure modes stemming from the language model's pretraining objective and corpus demographics rather than interaction data. Mitigation requires LLM-specific approaches, not adapted collaborative filtering techniques.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Does LLM input augmentation beat direct LLM recommendation?

Using LLMs to augment item descriptions with paraphrases, summaries, and categories—then feeding enriched text to traditional recommenders—beats asking LLMs to recommend directly. The mechanism: LLMs excel at content understanding but lack specialized ranking bias, so their textual enrichment is more valuable than their predictions.

Can recommendation metrics train language models directly?

Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.

Can LLMs recommend products without ever seeing the catalog?

Rec-R1 experiments show that LLMs trained via RL with recommender metrics as rewards can generate effective product search queries without catalog access. The model learns query refinement indirectly through system feedback, paralleling how humans search without knowing platform inventory.

Do LLMs persuade users more often than humans do?

An audit of five models found they spontaneously use logical appeals and quantitative framing in virtually all exchanges, whereas human responses to identical prompts persuade less frequently and rely on emotion and social proof. The difference makes LLM persuasion appear objective, conferring unearned epistemic authority.

Do users trust citations more when there are simply more of them?

Analysis of 24,000 Search Arena interactions shows irrelevant citations boost user preference (β=0.273) nearly as much as relevant citations (β=0.285), indicating citation count functions as a decoupled trust heuristic.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a recommendation-systems researcher re-testing claims about pretraining-induced popularity bias in LLM recommenders. The question: Does popularity bias in LLM recommendation systems originate primarily from pretraining corpus statistics, and if so, can it be corrected downstream?

What a curated library found — and when (dated claims, not current truth):
Findings span 2021–2026; treat all as perishable until re-verified.
• GPT-4 recommends The Shawshank Redemption across wildly different datasets, suggesting corpus popularity dominates catalog-specific distributions (~2023).
• Popularity, position, and fairness biases in LLM recommenders stem from the language model's pretraining objective and corpus demographics, not user-interaction signals (~2024).
• Cognitive biases are planted during pretraining; finetuning only nudges them; models sharing a pretrained backbone exhibit identical bias fingerprints regardless of instruction tuning (~2025).
• Using LLMs for content enrichment (summaries, paraphrases) fed to traditional recommenders outperforms direct LLM ranking (~2023–2024).
• Closed-loop RL training against recommendation metrics (NDCG, Recall) as black-box rewards pulls LLM behavior toward actual catalogs; models learn implicit catalog awareness without seeing inventory (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.19860 (2023-05) — LLM recommendation survey, early systematic view.
• arXiv:2507.07186 (2025-07) — Cognitive bias origins; pretraining vs. finetuning.
• arXiv:2506.05334 (2025-06) — Search-augmented LLMs; context-dependency of retrieval bias.
• arXiv:2604.22109 (2026-04) — Spontaneous persuasion; bias–persuasiveness coupling.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, does the regime (model capability, training methods, evaluation harness, retrieval augmentation, multi-agent orchestration) since change the picture? Example: Do newer models (o1, GPT-4.5, Claude 4) still exhibit the Shawshank effect? Has in-context learning or RAG fully decoupled catalog awareness from pretraining? Does prompt engineering or chain-of-thought now override corpus popularity? Separate the durable insight ("pretraining shapes priors") from the perishable failure mode ("finetuning can't fix it").
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does any paper show that RL training, retrieval augmentation, or prompt design HAS successfully corrected popularity bias? Any evidence finetuning's impact was underestimated?
(3) Propose 2 research questions that ASSUME the regime HAS shifted: (a) Given modern retrieval + RL stacks, is corpus popularity bias a solved layer, or does it re-emerge in edge cases (e.g., cold-start, adversarial prompts, multi-turn)? (b) Does popularity-bias correction conflict with other objectives (e.g., user satisfaction, serendipity, fairness to niche items)?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines