INQUIRING LINE

Can single-vector embeddings capture non-commutative relationships like word order?

This explores whether a single fixed-length vector can encode order-dependent meaning — the difference between 'dog bit man' and 'man bit dog' — or whether that's a geometric impossibility baked into how embeddings work.


This explores whether a single fixed-length vector can encode order-dependent meaning — the difference between 'dog bit man' and 'man bit dog' — and the corpus has a sharp, mostly-discouraging answer with an interesting caveat. The cleanest result is geometric: unit-sphere cosine spaces force concepts into linear superposition, and superposition is *commutative* — adding 'dog' + 'bit' + 'man' lands in the same place no matter the order Why can't cosine space retrievers distinguish word order?. So a vector that lives on a sphere and is compared by cosine similarity is, by construction, blind to the very thing word order encodes. The finding is that this isn't a training failure you can fix with more data — it's structural, and escaping it requires architectural moves like token-level interaction (think late-binding retrieval) or a downstream verification step rather than a single pooled vector.

That connects to a quieter problem with what embeddings actually measure. They capture *semantic association* — what co-occurs — not roles or relevance Do vector embeddings actually measure task relevance?. 'Dog,' 'bit,' and 'man' are all strongly associated regardless of who did the biting, so an association-based vector has no native handle on subject-vs-object. This is also why LLMs degrade so predictably on syntax: top models misidentify embedded clauses and verb phrases, and the errors get *worse* as structural depth increases, which is exactly what you'd expect if the system learned surface co-occurrence rather than grammatical structure Why do large language models fail at complex linguistic tasks?.

The caveat — and the part you didn't know you wanted — is that the contextualized activations *inside* a transformer do better than the static pooled embedding at the door. The Polar Probe shows models encode syntactic relations not just by distance but by *angle*: a polar-coordinate geometry where direction carries the type and orientation of a relation, nearly doubling accuracy over distance-only methods How do language models encode syntactic relations geometrically?. Direction is the trick that lets geometry become non-commutative — A→B is not B→A. So the limitation is less 'neural nets can't represent order' and more 'a single cosine-compared vector throws the directional information away when it collapses a sequence into one point.'

There's a representational-substrate angle too. Even static embeddings, before attention runs, are richer than 'just a lookup' — they carry valence, concreteness, and other lexical content Do transformer static embeddings actually encode semantic meaning? — and networks can spontaneously carve compositional tasks into modular subnetworks that handle pieces independently Do neural networks naturally learn modular compositional structure?. That suggests the machinery to represent structured, order-sensitive composition exists; it's the final pooling-into-one-vector-and-comparing-by-cosine step that destroys it.

The practical upshot for anyone building retrieval or search: if your queries hinge on order, negation, or who-did-what-to-whom, a single dense embedding will quietly conflate opposites, and no amount of fine-tuning fixes the geometry. The corpus points you toward multi-vector / token-level interaction or a verification pass on top — the same conclusion the cosine-space note reaches from first principles.


Sources 6 notes

Why can't cosine space retrievers distinguish word order?

Unit-sphere cosine spaces force concepts into linear superposition, a commutative structure that cannot robustly represent non-commutative distinctions like "dog bit man" versus "man bit dog." This geometric constraint persists regardless of training procedure and requires architectural alternatives like token-level interaction or downstream verification.

Do vector embeddings actually measure task relevance?

Embeddings encode co-occurrence patterns, making semantically close but role-distinct concepts highly similar. This works in simple demos but fails in production where underspecified queries have many wrong-but-associated candidates.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Do transformer static embeddings actually encode semantic meaning?

Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with re-evaluating whether single-vector embeddings can capture non-commutative relationships like word order — a question that remains fundamentally open despite recent progress.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable constraints to be re-tested:
• Unit-sphere cosine spaces are geometrically hostile to non-commutative structure: superposition is commutative, so 'dog bit man' and 'man bit dog' map to the same vector (geometric necessity, not training failure).
• Vector embeddings measure semantic association, not task-relevant roles or syntax; errors on embedded clauses worsen predictably with structural depth, implying surface co-occurrence learning (~2025).
• Transformer activations *inside* models encode syntax via polar-coordinate geometry—direction (not just distance) carries relation type and orientation, nearly doubling probe accuracy (~2024–2025).
• Static embeddings carry rich lexical content (valence, concreteness); networks spontaneously modularize compositional tasks (~2023). Final pooling + cosine comparison destroys directional info.
• Retrieval performance degrades under compositional shifts; dense embeddings conflate opposites on negation and role-reversal; multi-vector or verification passes are structural fixes (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2301.10884 (2023): Structural compositionality in neural networks.
• arXiv:2412.05571 (2024): Polar coordinate system for syntax in LLMs.
• arXiv:2503.19260 (2025): Linguistic blind spots of large language models.
• arXiv:2508.21038 (2025): Theoretical limitations of embedding-based retrieval.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the cosine-space geometry claim, polar encoding, and association-vs-syntax gap: has newer model scaling, architectural variants (e.g., learned pooling, multi-head retrieval, continuous attention), or training regimes (contrastive objectives targeting syntax) since late 2025 relaxed or overturned any of these? Separate the durable question (order-sensitivity in retrieval/search) from perishable limitations (e.g., 'standard pooling fails'—but learned pooling may not).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Any papers showing single vectors *can* capture order-dependent meaning at scale, or showing polar geometry is insufficient?
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Under what training objective or architectural constraint do single embeddings *acquire* order-sensitivity? (b) Do foundation models trained post-2025 exhibit different cosine-space limitations than the 2023–2025 cohort?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines