Why do frequent words rank higher in taxonomic abstraction hierarchies?
This explores why the most common words tend to sit near the top of meaning hierarchies — and what that frequency-abstraction link means for how language models drift toward generic over specific.
This explores why frequent words rank higher in taxonomic abstraction hierarchies — and the short version from the corpus is that it isn't a coincidence or a deep design choice, it's a statistical inevitability of how language is structured and how models absorb it. The clearest piece is the observation that general concepts (hypernyms like 'animal') simply occur more often than the specific ones nested under them (hyponyms like 'pygmy marmoset'). There are fewer abstract categories and they get reused across far more contexts, so abstraction and frequency are baked into the same gradient: climb the taxonomy and word frequency rises with you Does word frequency correlate with semantic abstraction?.
What makes this more than a vocabulary curiosity is where the hierarchy itself comes from. You might assume a model needs some dedicated machinery to build tree-like concept structure — but the geometry falls out directly from word co-occurrence statistics, with no hierarchy-specific mechanism required Where does hierarchical structure in language models come from?. The same spectral structure shows its hand in the ordering: the leading eigenvectors of embedding matrices split the broadest taxonomic branches first, then progressively finer ones, mirroring the WordNet hypernym tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?. So 'frequent = abstract = early in the spectral ordering' are three faces of one underlying co-occurrence regularity.
The consequence is the part worth knowing: because models carry a frequency bias, this gradient quietly pulls their output toward generality. Preferring the common paraphrase systematically drifts meaning upward toward abstraction, eroding the expert-level specificity that lives in rarer terms Does word frequency correlate with semantic abstraction?. That dovetails with evidence that LLMs compress concepts far more aggressively than humans do — they nail broad category structure but shed the fine-grained distinctions humans hold onto for situated, contextual meaning Do LLMs compress concepts more aggressively than humans do?.
If you want the flip side — what this implies for training — the corpus points to interventions that fight the frequency pull. One reverses the usual easy-to-hard curriculum by feeding rare data first, treating rarity not as conceptual difficulty but as a signal of where the model's distribution is weakest Does ordering training data by rarity actually improve language models?. Another sidesteps raw volume entirely by organizing knowledge into explicit domain taxonomies, so the model learns where a concept sits in a structure rather than just how often its words appear Can organizing knowledge structures beat raw training data volume?. Both are, in effect, ways of paying attention to the rare-and-specific that frequency-driven abstraction tends to wash out.
Sources 6 notes
WordNet analysis shows hypernyms (general concepts) occur more frequently than hyponyms (specific ones). Combined with LLMs' frequency bias, this means preferring common paraphrases systematically drifts toward abstraction, erasing expert-level specificity.
LLM hierarchical representations arise as a direct mathematical consequence of corpus statistics, not from hierarchy-specific mechanisms. Spectral analysis of word co-occurrence matrices predicts and reproduces the same nested geometry found in trained embeddings and word2vec models.
Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.
Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.
CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.
StructTuning achieves 50% of full-corpus performance using only 0.3% of training data by organizing chunks into auto-generated domain taxonomies. The model learns knowledge position within conceptual structures rather than raw text patterns, matching how students learn from textbooks.