Can MLPs learn to match dot product similarity in practice?

Universal approximation theory suggests MLPs should learn any similarity function, including dot product. But does this theoretical promise hold up when training on real, finite datasets with practical constraints?

Synthesis note · 2026-05-03 · sourced from Recommenders Architectures

The Neural Collaborative Filtering paper popularized replacing the dot product with a learned MLP for combining user and item embeddings. The justification was theoretical: an MLP is a universal function approximator, so it can in principle learn any similarity function — including dot product — and presumably better ones. Rendle et al.'s revisit shows this argument fails empirically and operationally.

Empirically, with careful hyperparameter selection, a properly configured dot product baseline substantially outperforms the MLP. Even more pointedly, learning a dot product through an MLP requires a large model capacity and a lot of training data — the universal approximation guarantee is asymptotic, and finite-data inductive bias matters more than expressiveness. The MLP is too flexible for the task; its inductive bias points away from the simple geometric similarity that actually fits the data.

Operationally, dot products allow maximum-inner-product search over precomputed item embeddings, which is fast enough for real-time serving over millions of items. MLP similarities require a forward pass per item per query — they cannot be precomputed. So even if MLPs were marginally more accurate, they would be unaffordable in production.

The takeaway: an inductive bias that matches the geometry of the problem (dot product) wins over an expressive parameterization that has to learn the geometry from scratch.

Inquiring lines that read this note 10

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do semantic similarity and task relevance diverge in vector embeddings?

How does example difficulty affect learning efficiency in language models?

Can universal function approximators be expensive to learn in practice?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

Can neural networks implement genuine algorithms or only statistical pattern matching?

How do transformer attention mechanisms implement memory and algorithmic functions?

What attentional bias objectives compete with dot product similarity for associative memory?

How does reasoning graph topology affect breakthrough insights and generalization?

Do substitute networks converge differently than complement networks?

Can graph structure and relationships fundamentally improve recommendation systems?

Why do cross-product features fail to generalize across unseen feature combinations?

What limits mechanistic interpretability's ability to characterize models?

Which hyperparameter theories best explain universal behaviors across neural networks?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 130 in 2-hop network ·dense cluster Open in graph ↗

Can MLPs learn to match dot product similarity i… Why does dot product beat MLP-based similarity in … Can simpler models beat deep networks for recommen… Can a linear model beat deep collaborative filteri… Why does multinomial likelihood work better for ra…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why does dot product beat MLP-based similarity in practice? Neural Collaborative Filtering theory suggests MLPs should outperform dot products as universal approximators. But what explains the empirical gap, and what role do data scale and deployment constraints play?
extends: paired statement of the same Rendle result emphasizing the inductive-bias-vs-capacity framing
Can simpler models beat deep networks for recommendation systems? Does removing hidden layers and constraining self-similarity create a more effective collaborative filtering approach than deep autoencoders? This challenges the assumption that architectural depth drives performance.
complements: same anti-deep-CF lesson at architecture level — capacity isn't the bottleneck
Can a linear model beat deep collaborative filtering? Does a shallow linear autoencoder with a zero-diagonal constraint outperform deeper neural models on collaborative filtering tasks? This challenges the field's assumption that depth and nonlinearity drive performance.
complements: same lesson — inductive bias and structural constraints matter more than depth or non-linearity
Why does multinomial likelihood work better for ranking recommendations? Explores whether the choice of likelihood function in VAE-based collaborative filtering matters for matching training objectives to ranking evaluation metrics. Why items should compete for probability mass.
complements: another structural-prior-matters-more result — likelihood choice over architectural depth

Can MLPs learn to match dot product similarity in practice?

Inquiring lines that read this note 10

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4