Do pretraining biases and traditional selection bias compound in production recommenders?
This explores whether two different sources of bias — the ones LLM recommenders absorb from language-model pretraining, and the classic selection bias that comes from only learning on items users already saw — stack on top of each other when a system runs in production.
This explores whether two distinct bias origins — biases baked in during language-model pretraining, and the older selection bias that comes from training only on items users were actually shown — combine and reinforce each other in a live recommender. The corpus suggests they're not the same problem, and that's exactly why they compound: each has a different source, so debiasing one leaves the other untouched.
Start with the pretraining side. When you build a recommender on top of an LLM, it arrives already opinionated. Wu et al. catalog three biases inherited straight from the pretraining objective and corpus demographics rather than from any interaction data — position, popularity, and fairness bias Where do recommendation biases come from in language models?. The sharpest illustration is GPT-4 concentrating its picks on items popular in *its pretraining corpus* rather than in your dataset — The Shawshank Redemption surfaces across datasets with totally different popularity distributions Where does LLM recommendation bias actually come from?. That's a domain-shift bias standard debiasing can't reach, because it isn't living in your logs.
Now the classic side. Traditional selection bias is a property of the feedback loop: the model learns from what past versions of the model chose to show. YouTube's multi-objective ranker treats this as a first-class problem, using a shallow position tower specifically to strip selection bias out of training data — and warns that without it, models converge on degenerate equilibria that amplify their own past decisions Why do ranking systems need to model selection bias explicitly?. There's a quieter structural version too: when embedding dimensions are too small, the system overfits toward popular items to maximize ranking quality, and that unfairness compounds over time as niche items keep getting starved of exposure Does embedding dimensionality secretly drive popularity bias in recommenders?.
Here's where they meet. Both biases push in the same direction — toward the popular, the already-seen, the safe — so in production they don't cancel, they align. The pretrained prior seeds a popularity skew before a single user clicks; the selection-bias feedback loop then takes whatever the system shows and feeds it back as 'evidence' that those items deserve exposure. The dimensionality result shows the loop tightening over time; the YouTube result shows it hardening into a self-confirming equilibrium. A system that fixes only collaborative-filtering selection bias still carries the pretraining skew untouched — and Wu et al. are explicit that LLM-inherited bias needs LLM-specific mitigation, not adapted collaborative-filtering techniques Where do recommendation biases come from in language models?.
The part you didn't know you wanted to know: this compounding doesn't stay inside the model. Recommendation feeds act as persuasion infrastructure where effects 'compound through rating contamination and selection biases,' shaping not just what users see but what producers make and what populations come to believe How do recommendation feeds shape what people see and believe?. So two technical biases with separate origins don't just degrade ranking quality — they braid together into a feedback loop that reshapes the catalog and the audience around their shared blind spot.
Sources 5 notes
Wu et al. show that LLM-based recommendation systems exhibit position bias, popularity bias, and fairness bias—unique failure modes stemming from the language model's pretraining objective and corpus demographics rather than interaction data. Mitigation requires LLM-specific approaches, not adapted collaborative filtering techniques.
GPT-4 concentrates recommendations on items popular in its pretraining corpus rather than in target datasets. The Shawshank Redemption dominates across different datasets even when they have different popularity distributions, revealing a domain-shift effect that standard debiasing methods cannot address.
YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.
Research shows that when user/item embedding dimensions are too small, recommender systems overfit toward popular items to maximize ranking quality. This compounds over time as niche items receive insufficient exposure, and cannot be fixed post-hoc without treating dimensionality as a fairness hyperparameter.
Research shows recommendation systems operate as political actors: feed weights influence producer behavior, network topology drives opinion convergence, and automation enables targeted persuasion at population scale. These effects compound through rating contamination and selection biases.