Why does dot product beat MLP-based similarity in practice?

Neural Collaborative Filtering theory suggests MLPs should outperform dot products as universal approximators. But what explains the empirical gap, and what role do data scale and deployment constraints play?

Synthesis note · 2026-05-03 · sourced from Recommenders Architectures

Neural Collaborative Filtering popularized replacing the dot product between user and item embeddings with a learned MLP, on the theory that an MLP — a universal function approximator — should subsume the dot product as a special case. Rendle and colleagues revisit the experiments and show two non-obvious results.

First, with proper hyperparameter tuning, the simple dot product substantially outperforms the MLP-based similarity. The original NCF gain came from undertuning the dot-product baseline, not from MLP expressiveness. Second, even though an MLP can in theory approximate any function, learning a dot product with an MLP requires both a large model and a large training set — the inductive bias of MLPs makes the dot-product structure expensive to recover from data.

The practical bite is in inference. Dot products admit Maximum Inner Product Search algorithms that retrieve top-K items in sublinear time over millions of items. MLP similarities require a forward pass per (user, item) pair, which is intractable at production scale. The paper concludes that MLPs as embedding combiners should be "used with care" — that the modern DNN architectures most competitive in NLP (transformers) and vision (resnets) all use dot products in their output layers reinforces the point. Universal approximation does not mean universal good choice; the inductive bias of the operator interacts with data scale and serving constraints.

Inquiring lines that read this note 10

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What structural factors drive popularity bias in recommendation systems?

How does the zero-diagonal constraint enable generalization in collaborative filtering?

Why do semantic similarity and task relevance diverge in vector embeddings?

How can LLM recommenders match or exceed collaborative filtering performance?

Can simpler collaborative filtering models outperform deep architectures?

How do transformer attention mechanisms implement memory and algorithmic functions?

What attentional bias objectives compete with dot product similarity for associative memory?

Can graph structure and relationships fundamentally improve recommendation systems?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 111 in 2-hop network ·medium cluster Open in graph ↗

Why does dot product beat MLP-based similarity i… Can MLPs learn to match dot product similarity in … Can simpler models beat deep networks for recommen… Can a linear model beat deep collaborative filteri… Can one model memorize and generalize better than …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can MLPs learn to match dot product similarity in practice? Universal approximation theory suggests MLPs should learn any similarity function, including dot product. But does this theoretical promise hold up when training on real, finite datasets with practical constraints?
extends: paired statement of the same Rendle result emphasizing the practical infeasibility of efficient retrieval
Can simpler models beat deep networks for recommendation systems? Does removing hidden layers and constraining self-similarity create a more effective collaborative filtering approach than deep autoencoders? This challenges the assumption that architectural depth drives performance.
complements: same lesson at architecture level — the right structural constraint beats depth
Can a linear model beat deep collaborative filtering? Does a shallow linear autoencoder with a zero-diagonal constraint outperform deeper neural models on collaborative filtering tasks? This challenges the field's assumption that depth and nonlinearity drive performance.
complements: same anti-depth lesson — anti-affinity and dot-product priors both outperform learned alternatives
Can one model memorize and generalize better than two? Does training memorization and generalization components jointly in a single model outperform training them separately and combining their predictions? This matters for building efficient recommendation systems that handle both rare and common user behaviors.
complements: industrial systems use simple structural priors (wide cross-product) for memorization rather than relying on MLP universality

Why does dot product beat MLP-based similarity in practice?

Inquiring lines that read this note 10

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4