INQUIRING LINE

Inquiring lines›How do language models construct a…›Can LLMs provide genuinely empathe…›Why do semantic similarity and tas…›this inquiring line

Most AI text embeddings treat 'dog bites man' and 'man bites dog' as nearly identical — the geometry is why.

Why do unit-sphere spaces fail at distinguishing word order and negation?

This explores why embedding models that squeeze meaning onto a unit sphere (where similarity is just the angle between vectors) struggle with things like word order and negation — and what the corpus says about the geometry behind that failure.

This explores why embedding models that pack meaning onto a unit sphere — where two texts are 'close' if the angle between their vectors is small — can't reliably tell "dog bit man" from "man bit dog," or "is" from "is not." The short version from the corpus: the geometry itself is the problem, not the training. Cosine space forces concepts into a *linear superposition* — you essentially add the pieces of meaning together. But addition is commutative (A+B equals B+A), while word order and negation are not. Swapping subject and object, or inserting a 'not,' should flip the meaning, yet the sphere has no clean way to represent an operation whose result depends on order. So the distinction gets smeared out Why can't cosine space retrievers distinguish word order?.

What makes this more than a complaint is the contrast with what richer geometries *can* do. The Polar Probe work shows that inside an LLM's activations, syntactic relations are encoded using *both* distance and angular position — type and direction at once — and that adding the angular dimension nearly doubles accuracy over distance-only methods How do language models encode syntactic relations geometrically?. That's the tell: when you give the representation a way to encode *direction*, asymmetric relations like 'who did what to whom' become expressible. A flat unit-sphere similarity score throws exactly that away, which is why the original note prescribes architectural fixes — token-level interaction or downstream verification — rather than more training.

The corpus also suggests the failure isn't confined to embeddings; it echoes a broader pattern where models handle surface statistics but not structure. LLMs systematically misparse embedded clauses and complex nominals, and the errors get worse in a predictable way as syntactic depth grows Why do large language models fail at complex linguistic tasks?. The same fault line shows up with negation-adjacent reasoning: models accept false presuppositions even when they demonstrably know the right answer, accommodating a buried wrong assumption rather than rejecting it Why do language models accept false assumptions they know are wrong?. Negation is precisely the kind of structural operator a commutative bag-of-meaning glosses over.

There's a deeper why underneath all of this. Several notes argue LLMs reason by *semantic association* rather than symbolic manipulation — when you strip the familiar semantics out of a task, performance collapses even with the correct rules sitting in context Do large language models reason symbolically or semantically?. Word order and negation are structural/symbolic operations, so a system leaning on association over composition is poorly equipped for them. You can even predict where this breaks: framing the model as an autoregressive probability machine correctly anticipates that low-probability, structurally-simple tasks (reversing a sequence, counting) will be hard Can we predict where language models will fail?. Reversal is the order problem in miniature.

The thing you didn't know you wanted to know: the fix people reach for isn't 'train harder' but 'change the shape of the space.' Whether it's polar coordinates that carry direction How do language models encode syntactic relations geometrically? or reasoning lifted up to the sentence level in a structured embedding space Can reasoning happen at the sentence level instead of tokens?, the move is the same — give meaning more geometric room than a single angle on a sphere, so that non-commutative distinctions have somewhere to live.

Sources 7 notes

Why can't cosine space retrievers distinguish word order?

Unit-sphere cosine spaces force concepts into linear superposition, a commutative structure that cannot robustly represent non-commutative distinctions like "dog bit man" versus "man bit dog." This geometric constraint persists regardless of training procedure and requires architectural alternatives like token-level interaction or downstream verification.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Show all 7 sources

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey4.27 match · arxiv ↗
Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners1.73 match · arxiv ↗
Probing Structured Semantics Understanding and Generation of Language Models via Question Answering1.73 match · arxiv ↗
Large Language Model Reasoning Failures1.71 match · arxiv ↗
A polar coordinate system represents syntax in large language models1.67 match · arxiv ↗
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds1.67 match · arxiv ↗
Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence1.65 match · arxiv ↗
Semantic Structure in Large Language Model Embeddings1.58 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: Why do unit-sphere embedding spaces fail to distinguish word order and negation? A curated library of LLM/embedding research (2023–2026) found — and when (dated claims, not current truth):

• Cosine similarity on unit spheres forces linear superposition (commutative addition), making asymmetric relations like word order and negation geometrically inexpressible (~2023–2024).
• Inside LLM activations, a polar coordinate system encoding both distance AND angular direction nearly doubles accuracy on syntactic relations over distance-alone methods (~2024).
• LLMs systematically misparse embedded clauses and negation-adjacent reasoning, errors worsening predictably with syntactic depth; models fail to reject false presuppositions even when knowledge is present (~2025).
• Models reason by semantic association rather than symbolic manipulation; performance collapses when familiar semantics are stripped, even with correct rules in context (~2023).
• Structured embeddings at sentence level or reasoning in continuous latent spaces show promise as fixes; the pattern is: richer geometry → richer expressiveness (~2024–2025).

Anchor papers (verify; mind their dates): arXiv:2305.14825 (May 2023, semantic vs. symbolic), arXiv:2412.05571 (Dec 2024, polar coordinates in activations), arXiv:2503.19260 (Mar 2025, linguistic blind spots), arXiv:2602.06176 (Feb 2026, reasoning failures).

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, judge whether newer models (o1, o3, Claude 3.5+, Llama 4), architectural innovations (mixture-of-experts, structured latent spaces), training methods (constitutional AI, mechanistic alignment), or evaluation suites have since RELAXED or OVERTURNED it. Separate the durable question (likely still open) from the perishable limitation (possibly resolved); cite what resolved it, plainly flag where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially any showing that unit-sphere or cosine-only methods now handle word order / negation better than the library suggests.
(3) Propose 2 research questions that ASSUME the representational regime may have shifted (e.g., via scaling, new training objectives, or deployment in agentic loops).

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Most AI text embeddings treat 'dog bites man' and 'man bites dog' as nearly identical — the geometry is why.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8