Is Cosine-Similarity of Embeddings Really About Similarity?
Cosine-similarity is the cosine of the angle between two vectors, or equivalently the dot product between their normalizations. A popular application is to quantify semantic similarity between high-dimensional objects by applying cosine-similarity to a learned low-dimensional feature embedding. This can work better but sometimes also worse than the unnormalized dot-product between embedded vectors in practice. To gain insight into this empirical observation, we study embeddings derived from regularized linear models, where closed-form solutions facilitate analytical insights. We derive analytically how cosine-similarity can yield arbitrary and therefore meaningless ‘similarities.’ For some linear models the similarities are not even unique, while for others they are implicitly controlled by the regularization. We discuss implications beyond linear models: a combination of different regularizations are employed when learning deep models; these have implicit and unintended effects when taking cosinesimilarities of the resulting embeddings, rendering results opaque and possibly arbitrary. Based on these insights, we caution against blindly using cosine-similarity and outline alternatives.
Introduction. Discrete entities are often embedded via a learned mapping to dense real-valued vectors in a variety of domains. For instance, words are embedded based on their surrounding context in a large language model (LLM), while recommender systems often learn an embedding of items (and users) based on how they are consumed by users. The benefits of such embeddings are manifold. In particular, they can be used directly as (frozen or fine-tuned) inputs to other models, and/or they can provide a data-driven notion of (semantic) similarity between entities that were previously atomic and discrete. While similarity in ’cosine similarity’ refers to the fact that larger values (as opposed to smaller values in distance metrics) indicate closer proximity, it has, however, also become a very popular measure of semantic similarity between the entities of interest, the motivation being that the norm of the learned embedding-vectors is not as important as the directional alignment between the embedding-vectors.
Discussion / Conclusion. It is common practice to use cosine-similarity between learned user and/or item embeddings as a measure of semantic similarity between these entities. We study cosine similarities in the context of linear matrix factorization models, which allow for analytical derivations, and show that cosine similarities are heavily dependent on the method and regularization technique, and in some cases can be rendered even meaningless. Our analytical derivations are complemented experimentally by qualitatively examining the output of these models applied simulated data where we have ground truth item-item similarity. Based on these insights, we caution against blindly using cosine-similarity, and proposed a couple of approaches to mitigate this issue.