SYNTHESIS NOTE

Does cosine similarity actually measure embedding similarity?

Cosine similarity is ubiquitous for comparing learned embeddings, but does it reliably capture semantic closeness? This work investigates whether regularization during training makes cosine scores arbitrary and unstable.

Synthesis note · 2026-06-03 · sourced from Flaws

Cosine similarity is the default tool for quantifying semantic similarity between learned embeddings, on the intuition that direction matters more than norm. This paper shows that intuition is unsafe. Using regularized linear (matrix-factorization) models where closed-form solutions allow analysis, it derives that cosine similarities can be arbitrary and therefore meaningless: for some models they are not even unique, and for others they are implicitly controlled by the regularization applied during training. Since deep models combine multiple regularizations with implicit and unintended effects, taking cosine similarities of their embeddings can render results opaque and possibly arbitrary.

The keeper is a methodological caution with teeth: the same embeddings can produce different "similarities" depending on regularization the practitioner never explicitly chose for similarity, so a cosine score is not a stable, model-independent measure of semantic closeness. The paper outlines alternatives and urges not using cosine blindly.

This sharpens the vault's embedding-geometry caveats. It is the regularization-dependence complement to Why can't cosine space retrievers distinguish word order? (geometry-dependence) and underwrites the production-RAG warning in Do vector embeddings actually measure task relevance?: cosine over learned embeddings is doubly unreliable — wrong target (association) and unstable measure (regularization-controlled).

Inquiring lines that read this note 3

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do semantic similarity and task relevance diverge in vector embeddings?

What limits mechanistic interpretability's ability to characterize models?

What makes regularization an implicit factor in embedding geometry?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 89 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why can't cosine space retrievers distinguish word order? Dense retrievers using unit-sphere cosine spaces struggle to capture non-commutative linguistic structures like negation and role reversal. Understanding this geometric constraint explains why training fixes have limited reach in compositional retrieval.
geometry-dependence; this adds regularization-dependence
Do vector embeddings actually measure task relevance? Vector embeddings rank semantic similarity, but RAG systems need topical relevance. When these diverge—as with king/queen versus king/ruler—does similarity-based retrieval fail in production?
cosine over embeddings is wrong target and unstable measure
Why does dot product beat MLP-based similarity in practice? Neural Collaborative Filtering theory suggests MLPs should outperform dot products as universal approximators. But what explains the empirical gap, and what role do data scale and deployment constraints play?
adjacent caution on naive similarity functions over embeddings

Does cosine similarity actually measure embedding similarity?

Inquiring lines that read this note 3

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4