Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini
We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields – from astronomy and bioscience to fine arts and the culinary arts – establishes it as a highly reliable, out-of-the-box representation even for specialized domains.
Introduction. Embedding models provide dense vector representations capturing semantic information that is crucial for adaptation in a wide range of downstream tasks. With foundational models being natively multimodal and powered with exceptionally growing capabilities, it is important to ensure embedding models capture semantic information within and across all modalities in a coherent manner. Such general-purpose embedding models will also enhance the performance across a broad spectrum of applications like video recommendations and document search which are rich in information across different modalities but since the contained modalities are not inherently homogenous, they can benefit from having rich semantic information from across all modalities. Existing multimodal embedding models like CLIP [1], ALIGN [2], SigLIP 2 [3], CoCa [4] embed heterogenous modalities by using paired cross-modal data and training modality-specific encoders to encode them into a unified vector space.
Discussion / Conclusion. Gemini Embedding 2 represents a transformative step forward in general-purpose representation, delivering a state-of-the-art multimodal successor to our text-only Gemini Embedding model. Gemini Embedding 2 generalizes well across a wide variety of tasks by seamlessly producing embeddings for arbitrary combinations of interleaved inputs across all modalities including text, image, audio, and video. By leveraging Gemini’s core multimodal, multilingual and code-centric foundations, the Gemini Embedding 2 model achieves landmark performance on well-known embedding benchmarks like MSCOCO, Vatex and MMTEB with a particularly significant leap in code retrieval. Our findings highlight its remarkable versatility, showing that it excels not only in general tasks but also across specialized domains such as microscopy, astronomy, and the culinary arts.