How does co-occurrence statistics alone produce hierarchical concept organization?
This explores how the plain statistics of which words appear near which other words — with no built-in tree-builder — can give rise to the nested, general-to-specific organization of concepts inside language models.
This explores how the plain statistics of which words appear near which other words can produce neat, tree-like concept hierarchies inside language models — without any mechanism that was designed to build hierarchies. The short version the corpus offers: the hierarchy isn't added, it falls out of the math. When you take the matrix of how often words co-occur and look at its spectral structure (its eigenvectors), the nested geometry of concepts is already there as a direct consequence of corpus statistics, not something the model learned a special trick to represent Where does hierarchical structure in language models come from?.
What makes this concrete is the *order* in which that structure appears. The leading eigenvectors of an embedding's Gram matrix carve the concept space coarsely first — broad branches like animal-vs-object — and then progressively finer ones, level by level, in a way that lines up strikingly well with WordNet's hand-built hypernym tree Do embedding eigenvectors organize taxonomy from coarse to fine?. So 'hierarchy' here isn't a stored tree; it's a spectral ordering, where the strongest statistical signal happens to be the most general distinction and weaker signals are the finer ones.
There's a frequency story underneath this that explains *why* the coarse stuff dominates. General words (hypernyms like 'animal') simply occur far more often than specific ones (hyponyms like 'beagle'), because every specific instance is also an instance of the general category Does word frequency correlate with semantic abstraction?. High frequency means strong, stable co-occurrence patterns, which means those distinctions land in the dominant directions of the statistics — i.e., near the top of the emergent tree. Abstraction rides on frequency, and frequency rides on counting.
Widen the lens and this fits a broader theme in the corpus: meaning in these models is purely *relational*. LLMs reconstruct culturally situated structure by compressing the relations between words alone — no grounding in the outside world required — which is essentially Saussure's idea of language as a system of differences operationalized by matrix algebra Can language models learn meaning without engaging the world?. Hierarchical organization is one shape that relational compression naturally takes; another is the way many semantic features collapse onto a few entangled low-dimensional axes that mirror human evaluation dimensions Do LLM semantic features organize along human evaluation dimensions?.
The interesting twist worth carrying away: 'emerges from statistics' isn't a downgrade. Circuit tracing inside actual trained models finds features genuinely arranged in tiers — tokens, then abstract concepts, then operations, then outputs — with bigger models growing *richer* abstract layers rather than just memorizing more How do language models organize features across processing layers?. The same co-occurrence pressure that builds the hierarchy also has a cost, though: it pushes models to compress aggressively toward broad categories, losing the fine-grained distinctions humans keep for situated use Do LLMs compress concepts more aggressively than humans do?. So co-occurrence statistics give you the skeleton of conceptual organization for free — but the same drive that builds the tree is what blurs its smallest branches.
Sources 7 notes
LLM hierarchical representations arise as a direct mathematical consequence of corpus statistics, not from hierarchy-specific mechanisms. Spectral analysis of word co-occurrence matrices predicts and reproduces the same nested geometry found in trained embeddings and word2vec models.
Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.
WordNet analysis shows hypernyms (general concepts) occur more frequently than hyponyms (specific ones). Combined with LLMs' frequency bias, this means preferring common paraphrases systematically drifts toward abstraction, erasing expert-level specificity.
Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.
Twenty-eight semantic axes in LLM embeddings reduce to three principal components matching human EPA structure. Intervening on one feature predictably shifts aligned features proportionally, creating unavoidable off-target effects that reflect how meaning is fundamentally organized.
Circuit tracing in Claude models reveals features progress from token-level inputs to abstract concepts to functional operations to outputs. Larger models develop richer abstract features, suggesting scaling enables higher-level conceptual reasoning rather than pattern memorization.
Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.