SYNTHESIS NOTE

Do language models use the hierarchical geometry they inherit?

Word2vec and Gemma share the same hierarchical spectral signature despite vastly different architectures and purposes. This suggests shared statistical origins, but leaves open whether the LLM actually recruits this structure for reasoning or simply inherits unused geometry.

Synthesis note · 2026-05-28 · sourced from MechInterp

The decisive move in the co-occurrence account of concept geometry is a cross-architecture comparison. The hierarchical splitting geometry is first derived and confirmed for word2vec embeddings across many WordNet subtrees. Then the same coarse-to-fine spectral signature is shown to extend "strikingly well" to Gemma 2B unembeddings. Two systems with entirely different objectives and training regimes — a shallow predict-context embedding and a large autoregressive transformer's output matrix — carry the same hierarchical fingerprint. If the structure were a functional artifact of how an LLM reasons, it should not appear, in the same form, in a model that does not reason at all.

This is the strongest available argument that the geometry is statistical, not functional: a shared signature across architectures points to a shared cause upstream of both — the co-occurrence statistics of the training text — rather than convergent functional design. Each word is characterized by discrete, continuous, and hierarchical attributes; words with similar attributes co-occur more often; and that alone gives rise to the geometric organization. Both models inherit it because both are, in different ways, fitting the same pairwise statistics.

Why it leaves a question open: the authors are explicit that such organization may be useful for function but is not driven by it — which leaves unresolved whether and where the LLM actually uses the hierarchical geometry it inherits. Shared structure proves common statistical origin; it does not prove the structure is inert in the transformer. Disentangling inherited-but-unused geometry from inherited-and-recruited geometry is the open problem this result sharpens rather than settles.

Inquiring lines that read this note 4

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do semantic similarity and task relevance diverge in vector embeddings?

What factors beyond surface content determine how readers extract meaning differently?

What spectral signatures distinguish hierarchy-driven geometry from corpus-driven geometry?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

Does Gemma's transformer explicitly exploit the inherited hierarchical geometry?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 93 in 2-hop network ·medium cluster Open in graph ↗

Do language models use the hierarchical geometry… Where does hierarchical structure in language mode… Do embedding eigenvectors organize taxonomy from c… Do standard analysis methods hide nonlinear featur…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Where does hierarchical structure in language models come from? Do LLMs build hierarchical concept geometry through dedicated mechanisms, or does it emerge naturally from word co-occurrence patterns in training data? Understanding the source matters for interpreting what representations actually reveal about model computation.
the cross-architecture match is the evidence for the structure-without-mechanism claim
Do embedding eigenvectors organize taxonomy from coarse to fine? Can we predict how embeddings encode taxonomic hierarchies by examining their spectral structure? This tests whether word co-occurrence statistics alone produce the observed hierarchical geometry in language models.
the specific signature shown to be shared between word2vec and Gemma
Do standard analysis methods hide nonlinear features in neural networks? Current representation analysis tools like PCA and linear probing may systematically miss complex nonlinear computations while over-reporting simple linear features. This raises questions about whether our interpretability methods are actually capturing what networks compute.
sharpens the open question — detected structure need not be the structure the model computes with

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

word2vec and gemma unembeddings share the same hierarchical signature so structure is statistical not functional

Do language models use the hierarchical geometry they inherit?

Inquiring lines that read this note 4

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4