Do language models use the hierarchical geometry they inherit?
Word2vec and Gemma share the same hierarchical spectral signature despite vastly different architectures and purposes. This suggests shared statistical origins, but leaves open whether the LLM actually recruits this structure for reasoning or simply inherits unused geometry.
The decisive move in the co-occurrence account of concept geometry is a cross-architecture comparison. The hierarchical splitting geometry is first derived and confirmed for word2vec embeddings across many WordNet subtrees. Then the same coarse-to-fine spectral signature is shown to extend "strikingly well" to Gemma 2B unembeddings. Two systems with entirely different objectives and training regimes — a shallow predict-context embedding and a large autoregressive transformer's output matrix — carry the same hierarchical fingerprint. If the structure were a functional artifact of how an LLM reasons, it should not appear, in the same form, in a model that does not reason at all.
This is the strongest available argument that the geometry is statistical, not functional: a shared signature across architectures points to a shared cause upstream of both — the co-occurrence statistics of the training text — rather than convergent functional design. Each word is characterized by discrete, continuous, and hierarchical attributes; words with similar attributes co-occur more often; and that alone gives rise to the geometric organization. Both models inherit it because both are, in different ways, fitting the same pairwise statistics.
Why it leaves a question open: the authors are explicit that such organization may be useful for function but is not driven by it — which leaves unresolved whether and where the LLM actually uses the hierarchical geometry it inherits. Shared structure proves common statistical origin; it does not prove the structure is inert in the transformer. Disentangling inherited-but-unused geometry from inherited-and-recruited geometry is the open problem this result sharpens rather than settles.
Inquiring lines that use this note as a source 4
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do leading embedding eigenvectors align with WordNet taxonomy structure?
- What spectral signatures distinguish hierarchy-driven geometry from corpus-driven geometry?
- Does the same spectral signature appear across different embedding models?
- Does Gemma's transformer explicitly exploit the inherited hierarchical geometry?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Where does hierarchical structure in language models come from?
Do LLMs build hierarchical concept geometry through dedicated mechanisms, or does it emerge naturally from word co-occurrence patterns in training data? Understanding the source matters for interpreting what representations actually reveal about model computation.
the cross-architecture match is the evidence for the structure-without-mechanism claim
-
Do embedding eigenvectors organize taxonomy from coarse to fine?
Can we predict how embeddings encode taxonomic hierarchies by examining their spectral structure? This tests whether word co-occurrence statistics alone produce the observed hierarchical geometry in language models.
the specific signature shown to be shared between word2vec and Gemma
-
Do standard analysis methods hide nonlinear features in neural networks?
Current representation analysis tools like PCA and linear probing may systematically miss complex nonlinear computations while over-reporting simple linear features. This raises questions about whether our interpretability methods are actually capturing what networks compute.
sharpens the open question — detected structure need not be the structure the model computes with
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence
- LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
- Break It Down: Evidence for Structural Compositionality in Neural Networks
- DataComp-LM: In search of the next generation of training sets for language models
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
- From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities
- Pixels, Patterns, but No Poetry: To See The World like Humans
- Semantic Structure in Large Language Model Embeddings
Original note title
word2vec and gemma unembeddings share the same hierarchical signature so structure is statistical not functional