SYNTHESIS NOTE
Recommender Systems

Why do ranking systems need to model selection bias explicitly?

Explores how training data from current rankers creates feedback loops that reinforce past decisions. Understanding this mechanism helps explain why naive approaches fail in production ranking systems.

Synthesis note · 2026-05-03 · sourced from Recommenders Architectures
What breaks when specialized AI models reach real users?

Industrial ranking systems face two distinct problems that interact. First, objectives conflict: engagement (clicks, watch time) and satisfaction (ratings, likes, shares) are not the same thing, and naive aggregation collapses them. YouTube's solution uses Multi-gate Mixture-of-Experts so each objective can choose which input experts it shares with others — soft parameter sharing rather than full-shared or fully-separate models.

Second, and more insidious: training data comes from logs of the current ranker. A user clicked a video because it was placed at position 1, not because they preferred it. Train on that data and you reinforce whatever the ranker did before — a positive feedback loop where the model keeps learning what it has already taught itself. The Wide & Deep extension here adds a shallow tower whose only job is to model position bias, factoring out the rank-induced effect from the engagement signal.

Two mechanisms because two failure modes: MMoE for objective conflict, shallow position tower for selection bias. Without explicit treatment of either, the model converges on a degenerate equilibrium.

RL-side echo — the same multi-objective problem, the same Pareto framing. DVAO confronts the recommender world's first problem (conflicting objectives) inside multi-reward GRPO for LLMs: accuracy, length, and format all push at once, and naive scalarization either explodes advantage magnitudes (Reward Combination) or ignores cross-objective correlation with static weights (Advantage Combination). Its fix — dynamically weighting each objective by its empirical reward variance within a rollout — is the RL analog of MMoE's soft per-objective parameter sharing: instead of a fixed mixing, let each objective's contribution adapt to where the live learning signal is. Both literatures converge on the same goal, a superior Pareto frontier across objectives rather than a single scalarized peak, and both reject fixed combination weights as the source of degeneration. The recommender's second problem — selection bias from logging the current policy — has no clean DVAO counterpart, but it rhymes with RLVR's shortcut-amplification: in both, training on data the current model generated reinforces whatever the model already did. DVAO does not address that feedback loop, which marks the boundary of the analogy.

Inquiring lines that use this note as a source 65

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 117 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

multi-objective ranking systems must explicitly model selection bias because data generated by the current ranker produces feedback loops