Do multi-vector or cross-encoder models escape these dimensional constraints?
This reads as a follow-up to the proof that a single fixed-dimension embedding caps how many top-k document combinations it can ever return — and asks whether richer scoring schemes (multi-vector late interaction, cross-encoders) sidestep that ceiling.
This explores whether the mathematical retrieval limit — that any embedding of dimension d can only represent a bounded number of top-k document subsets — is a property of *embeddings* or specifically of *single-vector dot-product scoring*. The corpus makes clear it's the latter. The limit in Do embedding dimensions fundamentally limit retrievable document combinations? comes from communication-complexity theory applied to one geometry: query and document each collapse to a single point in d-space, and relevance is their inner product. The constraint binds even when the embeddings are optimized directly on the test set — so no amount of training rescues a single vector. What changes the result is changing the interaction, not the training.
That's exactly the escape multi-vector and cross-encoder models take. A cross-encoder never reduces a document to a precomputable point; it scores each query-document pair jointly, so its expressivity isn't capped by a fixed d. Multi-vector (late-interaction) models keep many vectors per item and match them token-to-token, trading the clean polynomial bound for a much larger effective representational budget. The corpus doesn't carry a ColBERT or cross-encoder paper directly, but it carries the cleanest analog of the same move in recommendation: How can user vectors capture diverse interests without exploding in size?. The Deep Interest Network argues a single fixed-length user vector is a lossy bottleneck for diverse interests, and fixes it with *candidate-conditional* attention — the user representation is recomputed against each candidate item rather than frozen ahead of time. That is the late-interaction principle under a different name: relevance becomes a function of the pair, not of two independently-pooled points.
The catch is the same one that makes the embedding limit attractive in the first place. Single-vector retrieval is cheap because you precompute every document once and run nearest-neighbor search over a static index. The moment scoring depends jointly on query and candidate — cross-encoder or candidate-conditional attention — you lose precomputation and must score combinatorially at query time. So these models don't repeal the dimensional ceiling so much as pay to leave the regime where it applies. In practice the field stacks them: a cheap single-vector retriever (subject to the limit) for first-stage recall, a cross-encoder reranker (not subject to it) for the final ordering.
Two adjacent notes are worth pulling in before you conclude that more expressive geometry is strictly better. How do language models encode syntactic relations geometrically? and Do embedding eigenvectors organize taxonomy from coarse to fine? show single embedding spaces already encode far more structure than a flat distance reading suggests — polar angle carries syntactic direction, leading eigenvectors recover a coarse-to-fine taxonomy — so 'add more vectors' isn't the only way to recover lost expressivity within a fixed budget. And Can models be smart without organized internal structure? is the cautionary doorway: a model can hit the same retrieval metric while its internal organization is fractured and brittle, so escaping the dimensional bound on a benchmark doesn't guarantee the richer scorer generalizes under distribution shift. The honest summary: yes, cross-encoders and multi-vector models escape the bound — because the bound was never about retrieval in general, only about the single-vector shortcut that makes retrieval fast.
Sources 5 notes
Communication complexity theory proves that for any embedding dimension d, there exists a maximum number of top-k document combinations that can be returned as results. Even embeddings optimized directly on test data hit this polynomial limit, demonstrated on trivially simple retrieval tasks.
Deep Interest Network weights historical behaviors against each candidate ad, activating only relevant interests dynamically. This preserves dimension efficiency while expressing diverse tastes without lossy compression.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.