Can MLPs learn to match dot product similarity in practice?
Universal approximation theory suggests MLPs should learn any similarity function, including dot product. But does this theoretical promise hold up when training on real, finite datasets with practical constraints?
The Neural Collaborative Filtering paper popularized replacing the dot product with a learned MLP for combining user and item embeddings. The justification was theoretical: an MLP is a universal function approximator, so it can in principle learn any similarity function — including dot product — and presumably better ones. Rendle et al.'s revisit shows this argument fails empirically and operationally.
Empirically, with careful hyperparameter selection, a properly configured dot product baseline substantially outperforms the MLP. Even more pointedly, learning a dot product through an MLP requires a large model capacity and a lot of training data — the universal approximation guarantee is asymptotic, and finite-data inductive bias matters more than expressiveness. The MLP is too flexible for the task; its inductive bias points away from the simple geometric similarity that actually fits the data.
Operationally, dot products allow maximum-inner-product search over precomputed item embeddings, which is fast enough for real-time serving over millions of items. MLP similarities require a forward pass per item per query — they cannot be precomputed. So even if MLPs were marginally more accurate, they would be unaffordable in production.
The takeaway: an inductive bias that matches the geometry of the problem (dot product) wins over an expressive parameterization that has to learn the geometry from scratch.
Inquiring lines that use this note as a source 10
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes dot product efficient for real-time retrieval over millions of items?
- How do MIPS algorithms constrain the choice of similarity functions?
- Can universal function approximators be expensive to learn in practice?
- Can neural networks implement genuine algorithms or only statistical pattern matching?
- What attentional bias objectives compete with dot product similarity for associative memory?
- Do substitute networks converge differently than complement networks?
- Why do cross-product features fail to generalize across unseen feature combinations?
- Why is a combinatorial framework better than family resemblance classification?
- Which hyperparameter theories best explain universal behaviors across neural networks?
- How should practitioners measure similarity between embeddings safely?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why does dot product beat MLP-based similarity in practice?
Neural Collaborative Filtering theory suggests MLPs should outperform dot products as universal approximators. But what explains the empirical gap, and what role do data scale and deployment constraints play?
extends: paired statement of the same Rendle result emphasizing the inductive-bias-vs-capacity framing
-
Can simpler models beat deep networks for recommendation systems?
Does removing hidden layers and constraining self-similarity create a more effective collaborative filtering approach than deep autoencoders? This challenges the assumption that architectural depth drives performance.
complements: same anti-deep-CF lesson at architecture level — capacity isn't the bottleneck
-
Can a linear model beat deep collaborative filtering?
Does a shallow linear autoencoder with a zero-diagonal constraint outperform deeper neural models on collaborative filtering tasks? This challenges the field's assumption that depth and nonlinearity drive performance.
complements: same lesson — inductive bias and structural constraints matter more than depth or non-linearity
-
Why does multinomial likelihood work better for ranking recommendations?
Explores whether the choice of likelihood function in VAE-based collaborative filtering matters for matching training objectives to ranking evaluation metrics. Why items should compete for probability mass.
complements: another structural-prior-matters-more result — likelihood choice over architectural depth
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Neural Collaborative Filtering vs. Matrix Factorization Revisited
- KAN: Kolmogorov-Arnold Networks
- Curse of “Low” Dimensionality in Recommender Systems
- Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities
- Deep Interest Network for Click-Through Rate Prediction
- On the Theoretical Limitations of Embedding-Based Retrieval
- Is Cosine-Similarity of Embeddings Really About Similarity?
- Titans: Learning to Memorize at Test Time
Original note title
MLP similarity does not approximate dot product in practice — universal approximation theorems do not survive contact with finite data