Why does inductive bias outweigh model capacity in recommender systems?
This explores why the right structural assumptions baked into a recommender — the constraints and priors that shape what it can learn — often beat simply making the model bigger or deeper.
This explores why the right structural assumptions baked into a recommender often beat simply scaling up model size. The corpus has a surprisingly blunt answer hiding in two closely related results: a shallow linear model can flatly outperform deep neural networks at collaborative filtering — but only if you give it the right constraint. In EASE Can simpler models beat deep networks for recommendation systems? and its sibling ESLER Can a linear model beat deep collaborative filtering?, the trick is a single rule: an item is forbidden from predicting itself (the diagonal of the item-item weight matrix is pinned to zero). That one constraint forces every prediction to route through relationships *between* items rather than letting the model cheat by memorizing each item's own signal. The negative weights that emerge — encoding which items actively repel each other — turn out to matter more than any amount of hidden-layer depth. The lesson isn't 'simple is better'; it's that a well-chosen prior tells the model where *not* to look, and that focusing is worth more than raw capacity.
Why does this happen specifically in recommendation? Because the failure modes here aren't about expressiveness — they're about systems collapsing into degenerate, self-reinforcing equilibria. A high-capacity model that's free to fit the data will happily overfit toward whatever's already popular. You can watch this directly: when embedding dimensions are too small the system overfits to popular items to maximize ranking quality, compounding into long-term unfairness as niche items starve for exposure Does embedding dimensionality secretly drive popularity bias in recommenders?. More capacity doesn't fix that; it can deepen the rut. What fixes it is a structural intervention — treating dimensionality as a fairness knob, or building an explicit mechanism that breaks the loop.
That's the pattern across the corpus: the wins come from architectural priors that prevent pathologies, not from bigger function approximators. YouTube's ranker needs a dedicated shallow 'position tower' to subtract selection bias out of training data, or the model converges on amplifying its own past decisions Why do ranking systems need to model selection bias explicitly?. Accuracy-optimized models systematically crowd out minority interests and need an explicit calibration constraint bolted on to restore proportional representation Why do accuracy-optimized recommenders crowd out minority interests?. In each case the corrective is a designed bias — a prior about what a good recommendation *should* respect — not more learning capacity.
The same logic shows up when capacity genuinely is the bottleneck: the answer is still usually a better prior, not a deeper net. Cold-start gets solved by injecting graph structure and side information so the model can reason about users it has never seen Can autoencoders solve the cold-start problem in recommendations?; sparse-user explanations get solved by retrieval augmentation that brings in outside signal rather than squeezing more from a thin history Can retrieval enhancement fix explainable recommendations for sparse users?; and representing a user as several attention-weighted personas, rather than one fat latent vector, buys both diversity and interpretability for free Can attention mechanisms reveal which user taste explains each recommendation?. Each is a smarter assumption about the *shape* of the problem.
The twist worth taking away: even the LLM-era recommenders confirm this. When you point a large language model at recommendation, the bottleneck isn't its enormous capacity — it's the wrong inherited priors. LLM recommenders carry position, popularity, and fairness biases baked in from language pretraining, failure modes that have nothing to do with interaction data and can't be patched with borrowed collaborative-filtering tricks Where do recommendation biases come from in language models?. So the most capable models available still lose to the inductive-bias problem. Capacity sets the ceiling; the prior decides whether you ever reach it.
Sources 9 notes
EASE, a shallow linear item-item weight matrix with diagonal constrained to zero, beats deep neural baselines on most datasets. The constraint forces generalization by forbidding self-prediction, while learned negative weights capture item dissimilarity—a structural prior more valuable than model capacity.
ESLER, a single-layer linear autoencoder constrained so items cannot predict themselves, outperforms most deep CF models. The constraint forces prediction through item relationships, and negative weights encoding anti-affinity prove essential—structural bias matters more than model capacity.
Research shows that when user/item embedding dimensions are too small, recommender systems overfit toward popular items to maximize ranking quality. This compounds over time as niche items receive insufficient exposure, and cannot be fixed post-hoc without treating dimensionality as a fairness hyperparameter.
YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.
Accuracy-optimized models systematically miscalibrate by over-weighting dominant user interests. A post-processing reranking algorithm that enforces calibration constraints can restore proportional representation without retraining the underlying model.
GHRS uses graph features and deep autoencoders to integrate rating history with side information, enabling predictions for new users and items by discovering non-linear relationships that linear hybrid methods miss.
ERRA combines model-agnostic review retrieval with personalized aspect selection to address data sparsity that embedded methods cannot solve. Retrieval augmentation provides richer signal when user history is sparse, while aspect personalization ensures explanations match user context rather than generic defaults.
AMP-CF represents each user as multiple latent personas weighted dynamically by candidate item. This makes recommendations both diverse and interpretable—each suggestion traces to the specific persona preference it satisfies—without requiring post-hoc reranking.
Wu et al. show that LLM-based recommendation systems exhibit position bias, popularity bias, and fairness bias—unique failure modes stemming from the language model's pretraining objective and corpus demographics rather than interaction data. Mitigation requires LLM-specific approaches, not adapted collaborative filtering techniques.