How do MIPS algorithms constrain the choice of similarity functions?
This explores why production-scale retrieval systems can't pick any similarity function they want — the MIPS (Maximum Inner Product Search) algorithms that make billion-item retrieval fast only work for one geometric shape, and that constraint quietly dictates how models score matches.
This explores why the choice of similarity function in large retrieval systems isn't free: MIPS — Maximum Inner Product Search — is the family of algorithms that lets you find the best match among billions of items in milliseconds, and it only works when 'similarity' is a dot product. The constraint runs backwards from infrastructure to model design. You'd think the smart move is to learn the richest possible scoring function, but the corpus shows the opposite pressure: a function you can't index at scale is a function you can't use in production, no matter how expressive it is.
The cleanest evidence sits in the collaborative-filtering work where Rendle and colleagues pit learned MLP similarities against a plainly-tuned dot product Why does dot product beat MLP-based similarity in practice?. An MLP is a universal function approximator — in theory it can represent any similarity, including the dot product itself. In practice it loses, and badly: it needs far more capacity and data just to match simple geometric similarity Can MLPs learn to match dot product similarity in practice?. But the deeper point for MIPS is the second failure: even if an MLP *did* learn a great similarity, you couldn't retrieve against it efficiently. There's no indexing trick that finds the top match of an arbitrary neural network over a billion candidates. The dot product wins twice — once on accuracy from inductive bias, and once because MIPS algorithms can only accelerate inner-product geometry. That's the constraint in one sentence: pick a non-dot-product similarity and you give up sub-linear retrieval.
What's interesting is that the field keeps trying to *escape* this constraint by splitting scoring into two stages — use cheap dot-product MIPS to recall candidates, then apply an expensive, expressive function only to the survivors. You can see this exact move in retrieval verification: a pooled-cosine (dot-product-style) recall pass pulls candidates fast, then a small Transformer verifier operating on full token-token interaction patterns rejects the structural near-misses that the compressed vectors waved through Can verification separate structural near-misses from topical matches?. The expensive function never has to run at billion-scale because MIPS already narrowed the field. The constraint isn't defeated — it's quarantined to the first stage, which is the only stage that has to be MIPS-compatible.
The same architectural logic — keep the part that must scale geometrically simple, push richness elsewhere — shows up in how memory and computation get split. Engram pairs O(1) lookup with learned routing rather than asking one mechanism to do both, finding that balanced allocation beats either alone Can lookup memory and computation work together better than either alone?. The recurring lesson across these notes is that retrievability is a first-class design constraint, not an afterthought: the geometry that makes search tractable (inner products, indexable vectors) is decided *before* you decide how clever your scoring can be.
If you want the punchline a curious reader might not expect: MIPS doesn't just prefer the dot product, it effectively forbids alternatives at scale, and that's why a 'worse' but indexable similarity beats a 'better' but un-indexable one. Expressiveness you can't retrieve against is expressiveness you can't deploy — so the algorithm that does the searching ends up dictating the math the model is allowed to learn.
Sources 4 notes
Rendle et al. show properly-tuned dot products substantially beat MLP-based similarity despite MLP universality. Learning a dot product with an MLP requires large models and datasets; dot products also enable efficient retrieval at production scale through MIPS algorithms.
Rendle et al. show that carefully tuned dot products substantially outperform learned MLP similarities in collaborative filtering. MLPs require excessive capacity and data to match simple geometric similarity, and they cannot be efficiently retrieved at scale—proving inductive bias matters more than expressiveness.
A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.
Engram combines O(1) N-gram lookup with Mixture-of-Experts routing, revealing a U-shaped scaling law where balanced allocation to both mechanisms outperforms either alone. Gains appear largest in reasoning and code rather than pure retrieval.