Can a linear model beat deep collaborative filtering?
Does a shallow linear autoencoder with a zero-diagonal constraint outperform deeper neural models on collaborative filtering tasks? This challenges the field's assumption that depth and nonlinearity drive performance.
A surprising empirical result: a linear model with no hidden layer outperforms most deep collaborative-filtering models. ESLER (called easer) is a single item-item weight matrix B trained as an autoencoder where the input vector is the user's interaction history and the output reconstructs the same history. The single non-trivial constraint is that the diagonal of B must be zero — an item cannot use itself to predict itself.
This constraint is doing all the work. Without it, the model trivially copies inputs to outputs and learns nothing. With it, predicting whether a user likes item i forces the model to express i in terms of the other items the user interacted with, which is exactly what generalization in collaborative filtering requires. About 60% of the learned weights turn out to be negative, indicating the model is also learning dissimilarities between items, not just similarities. Setting negative weights to zero degrades performance to roughly the level of L1-regularized SLIM, suggesting that what made easer special wasn't sparsity but the ability to encode anti-affinity.
The closed-form training takes a few lines of code and orders of magnitude less time than SLIM. The result challenges the field's assumption that depth and non-linearity are essential for CF — the right structural constraint matters more than expressive capacity, mirroring the Rendle et al. dot-product result for similarity functions.
Inquiring lines that use this note as a source 38
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do negative item weights matter more than model depth?
- How does precision matrix structure differ from covariance in recommendations?
- How does the zero-diagonal constraint enable generalization in collaborative filtering?
- Can embedding-based integration preserve both LLM text strength and collaborative filtering signal?
- Why do LLM recommenders underperform item-only collaborative filtering baselines?
- How do structural constraints like zero self-similarity improve collaborative filtering?
- Why does inductive bias outweigh model capacity in recommender systems?
- What structural constraints replace depth in collaborative filtering?
- What happens when multiple recommendation objectives compete without explicit modeling?
- Do embedding collisions explain popularity overfitting in recommendation models?
- Why do embedding-based recommendation models fail with sparse user history?
- Why do linear hybrid models fail to capture user-item relationships?
- What non-linear patterns do autoencoders discover that matrix factorization misses?
- Why do standard supervised models miss high-order connectivity in recommendations?
- Why do dual-encoder embeddings fail to capture task-relevant recommendations despite semantic similarity?
- How does this compare to trained autoencoder approaches for thought sharing?
- Can structural priors outperform raw model capacity in collaborative filtering?
- Why does sparsity per user make probabilistic models more effective?
- Can simpler collaborative filtering models outperform deep architectures?
- How does VAE regularization strength affect sparse implicit feedback data?
- How does per-user sparsity influence likelihood choice for recommendations?
- How does popularity bias emerge from low-dimensional embeddings?
- Why does per-user sparsity make cross-user aggregation essential for recommendations?
- Why do multinomial likelihoods outperform Gaussian models for recommendation?
- Can fractured entangled representations hide undetected by standard analysis methods?
- Why do singular value experts compose better than low-rank adapter subspaces?
- Can hypernetworks generate recommendation parameters more efficiently than retraining full models?
- What is the curse of directionality in aggregation-based recommenders?
- Why does weight sparsity reduce superposition and force disentangled representations?
- Why do cross-product features memorize better than dense embeddings?
- What sparse high-rank patterns does the deep tower fail to capture?
- Can autoencoders act as associative memory systems like Hopfield networks?
- Why should deep learning theory prioritize average-case over worst-case analysis?
- What does a human-parseable framework for deep learning look like?
- Do generic kernel-decay assumptions alone explain coarse-to-fine spectral ordering?
- What makes regularization an implicit factor in embedding geometry?
- Can encoder-only architectures match decoder-based sequential models for recommendation?
- Can attention linearity achieve similar efficiency gains as weight quantization?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can simpler models beat deep networks for recommendation systems?
Does removing hidden layers and constraining self-similarity create a more effective collaborative filtering approach than deep autoencoders? This challenges the assumption that architectural depth drives performance.
extends: paired re-statement of the same EASE/easer result emphasizing the precision-matrix-vs-covariance distinction
-
Why does dot product beat MLP-based similarity in practice?
Neural Collaborative Filtering theory suggests MLPs should outperform dot products as universal approximators. But what explains the empirical gap, and what role do data scale and deployment constraints play?
complements: paired anti-deep-CF lesson — the right inductive bias matters more than the universal approximation guarantee
-
Can MLPs learn to match dot product similarity in practice?
Universal approximation theory suggests MLPs should learn any similarity function, including dot product. But does this theoretical promise hold up when training on real, finite datasets with practical constraints?
complements: capacity-vs-bias point at the similarity layer; easer makes it at the architecture-depth layer
-
Why does multinomial likelihood work better for click prediction?
Explores whether the choice of likelihood function—multinomial versus Gaussian or logistic—affects recommendation performance, and what structural properties make one better suited to modeling user clicks.
complements: another simpler-with-the-right-prior result — likelihood choice matters more than depth
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Embarrassingly Shallow Autoencoders for Sparse Data*
- Variational Autoencoders for Collaborative Filtering
- Collaborative Deep Learning for Recommender Systems
- Neural Collaborative Filtering vs. Matrix Factorization Revisited
- Neural Collaborative Filtering
- GHRS: Graph-based Hybrid Recommendation System with Application to Movie Recommendation
- Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)
- Wide & Deep Learning for Recommender Systems
Original note title
ESLER easer beats deep models on collaborative filtering by constraining self-similarity to zero — proving model depth is not what mattered