Do embedding eigenvectors organize taxonomy from coarse to fine?
Can we predict how embeddings encode taxonomic hierarchies by examining their spectral structure? This tests whether word co-occurrence statistics alone produce the observed hierarchical geometry in language models.
The hierarchical geometry of concept embeddings is not just present but ordered in a specific way. When you take the embedding Gram matrix and read off its leading eigenvectors, the first ones separate the broadest taxonomic branches; later eigenvectors split progressively finer sub-branches. The spectral organization is coarse-to-fine, and it tracks the WordNet hypernym tree level by level. This is a stronger claim than "the representation has hierarchical structure" — it specifies where in the spectrum each level of the taxonomy lives.
The pattern is what makes the underlying co-occurrence theory falsifiable rather than merely suggestive. A purely descriptive observation that embeddings cluster by category could be explained many ways; a derived prediction that the principal components encode the taxonomy from coarse to fine, confirmed across many sampled WordNet subtrees, is a tight fit between a statistical mechanism and an observed geometry. The eigenvalue ordering is the fingerprint: dominant variance carries the broad ontological cuts (animal vs. artifact), residual variance carries the fine ones (terrier vs. spaniel).
Why it matters: this gives interpretability a concrete, model-agnostic probe. If you want to test whether a representation space encodes a taxonomy in the way co-occurrence statistics predict, you check the spectral ordering against the tree depth — and the same probe applies to any embedding determined by co-occurrence, not just transformer internals. The counterpoint is that coarse-to-fine spectral order is exactly what generic kernel-decay assumptions produce, so finding it is evidence for the statistical account, not for a bespoke hierarchical computation.
Inquiring lines that use this note as a source 42
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do low-dimensional representation structures entangle multiple cultures together?
- How do embedding dimension limits constrain what concept models can represent?
- What compression explains why syntax fits in low-dimensional subspaces?
- Do grokking phases correspond to transitions between nesting levels?
- Can contrastive learning fix the semantic association problem in embeddings?
- What mathematical limits constrain embedding-based retrieval systems?
- How does embedding dimension affect which documents can rank together?
- How does discretization make item representations more distinguishable?
- Why does text encoding create different subspaces across domains?
- Why do embeddings measure semantic association instead of task relevance?
- Do multi-vector or cross-encoder models escape these dimensional constraints?
- What semantic classifier design avoids lexical variation without genuine conceptual distinctness?
- Can lower embedding dimensions alone solve the diversity problem without attention mechanisms?
- Can steering vectors prove that representations are genuinely organized?
- What fine-grained distinctions matter most for human situated action in categories?
- Why do embedding-based retrieval systems fail on vocabulary mismatch?
- Can multi-facet item identifiers preserve both uniqueness and semantic meaning?
- How do different feed-weighting schemes construct distinct network topologies at population scale?
- Does the linear representation hypothesis reflect networks or reflect our analysis tools?
- What sampling strategies prevent nonsensical combinations when composing taxonomy nodes?
- Can representation engineering cleanly isolate single features in entangled semantic space?
- Can hierarchical key point structures improve opinion summarization?
- Why must world models be nested rather than flat and uniform?
- How does iconicity detection work within static embeddings before any attention?
- How do static embeddings and contextualized representations divide semantic labor?
- Why is a combinatorial framework better than family resemblance classification?
- What makes modernized N-gram embeddings composable with transformer architectures?
- Why do leading embedding eigenvectors align with WordNet taxonomy structure?
- What spectral signatures distinguish hierarchy-driven geometry from corpus-driven geometry?
- Can vector embeddings measure task relevance instead of semantic similarity?
- How do hierarchical knowledge layers capture different types of narrative information?
- Why do frequent words rank higher in taxonomic abstraction hierarchies?
- Does the same spectral signature appear across different embedding models?
- Do generic kernel-decay assumptions alone explain coarse-to-fine spectral ordering?
- Can spectral eigenvector ordering serve as a model-agnostic interpretability probe?
- How do vector embeddings fail to capture task-relevant document relationships?
- How does co-occurrence statistics alone produce hierarchical concept organization?
- What makes hierarchical reasoning effective for taxonomy induction?
- What physical structure does a Gaussian-regularized latent space actually encode?
- Why do embeddings measure association instead of actual task relevance?
- What makes regularization an implicit factor in embedding geometry?
- How do latents at the same hierarchy level become more correlated than tokens?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Where does hierarchical structure in language models come from?
Do LLMs build hierarchical concept geometry through dedicated mechanisms, or does it emerge naturally from word co-occurrence patterns in training data? Understanding the source matters for interpreting what representations actually reveal about model computation.
this coarse-to-fine ordering is the specific prediction of the distributional mechanism
-
Does word frequency correlate with semantic abstraction?
Explores whether LLMs' preference for high-frequency language also pulls them toward more abstract, general meanings—and whether this shapes how they handle expert knowledge.
both ground the abstraction structure of representations in WordNet-level statistical regularities
-
Do language models use the hierarchical geometry they inherit?
Word2vec and Gemma share the same hierarchical spectral signature despite vastly different architectures and purposes. This suggests shared statistical origins, but leaves open whether the LLM actually recruits this structure for reasoning or simply inherits unused geometry.
grounds: the cross-model evidence that the coarse-to-fine spectral order is a statistical fingerprint, not a transformer-specific computation — the same probe applies to word2vec
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence
- Semantic Structure in Large Language Model Embeddings
- Topic Modeling in Embedding Spaces
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
- From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
- A polar coordinate system represents syntax in large language models
- Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words
- Word Meanings in Transformer Language Models
Original note title
the leading embedding eigenvectors split taxonomy coarse to fine mirroring the wordnet tree