What spectral signatures distinguish hierarchy-driven geometry from corpus-driven geometry?
This explores whether you can tell, from the spectral fingerprint of an embedding space, that hierarchy was 'built in' versus that it fell out of raw text statistics — and the corpus's surprising answer is that there's no distinguishing signature, because the two are the same thing.
This explores whether you can look at the spectral fingerprint of an embedding space and tell apart geometry that comes from a built-in hierarchy mechanism versus geometry that simply precipitates out of word-counting. The sharp finding across the corpus is that the distinction you're reaching for mostly collapses: the spectral signature that *looks* hierarchy-driven turns out to be corpus-driven all the way down.
The clearest tell is the coarse-to-fine eigenvector order. When you take the Gram matrix of an embedding space and look at its leading eigenvectors, they split the vocabulary along broad taxonomic branches first, then progressively finer sub-branches, tracking the WordNet hypernym tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?. That layered spectral order is exactly what you'd predict from co-occurrence statistics alone — no hierarchy-specific machinery required. So the signature isn't evidence of a dedicated hierarchy mechanism; it's evidence of how often words show up near each other Where does hierarchical structure in language models come from?.
The knockout argument comes from comparing models that should have nothing in common. Word2vec embeddings and Gemma 2B unembeddings — trained with entirely different objectives — carry *identical* coarse-to-fine spectral signatures across WordNet taxonomies Do language models use the hierarchical geometry they inherit?. If the geometry were driven by some functional need for hierarchy, two such different systems wouldn't converge on the same eigenstructure. The shared fingerprint can only come from the one thing they share: the statistics of training text. In other words, when you spectrally decompose either model, you're reading the corpus, not the architecture.
What does a genuinely *different* geometric channel look like, for contrast? The Polar Probe work shows that syntactic relations live in a separate, angular code — type and direction of grammatical relations are encoded through both distance and angular position, not the radial coarse-to-fine nesting that taxonomy uses How do language models encode syntactic relations geometrically?. That's the useful lateral move: the corpus has more than one kind of structured geometry, and they have distinguishable signatures (radial-nested for taxonomy, polar-angular for syntax) — but the meaningful axis of difference is *taxonomy vs. syntax*, not *hierarchy-mechanism vs. corpus*.
Worth knowing the caveat: a clean spectral signature doesn't guarantee the model actually *uses* the structure well. Models can carry all the linearly decodable features for a task while their internal organization is fractured and fragile Can models be smart without organized internal structure?, and grammatical competence degrades predictably as structural depth and recursion increase Does LLM grammatical performance decline with structural complexity?. So the spectrum tells you the geometry inherited from the corpus is *there* — it doesn't tell you the model reliably reasons over it.
Sources 6 notes
Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.
LLM hierarchical representations arise as a direct mathematical consequence of corpus statistics, not from hierarchy-specific mechanisms. Spectral analysis of word co-occurrence matrices predicts and reproduces the same nested geometry found in trained embeddings and word2vec models.
Word2vec embeddings and Gemma 2B unembeddings share identical coarse-to-fine spectral signatures across WordNet taxonomies. Since these models have entirely different objectives, the shared structure must originate from training text statistics rather than convergent functional needs.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.