How do static embeddings and contextualized representations divide semantic labor?
This explores the question of who does what: how much meaning lives in a word's static embedding (its baseline lexical entry, before any sentence is processed) versus how much is built on the fly by attention as the model reads context.
This explores the division of labor between static embeddings — the fixed vector a word carries before the model reads anything around it — and contextualized representations, the activations attention builds as it processes a sentence. The corpus suggests the split is real and surprisingly principled: static embeddings already carry a heavy load of meaning, and attention specializes in everything that depends on neighbors.
The striking finding is how much semantic work happens before attention even fires. Clustering of RoBERTa's static embeddings shows sensitivity to valence, concreteness, iconicity, and taboo — psycholinguistic properties we'd assume require understanding, present in the raw lexical entry Do transformer static embeddings actually encode semantic meaning?. The structure goes deeper than isolated word features: the leading eigenvectors of the embedding space split taxonomy coarse-to-fine, separating broad categories first and finer distinctions later, tracking the WordNet hypernym tree level by level Do embedding eigenvectors organize taxonomy from coarse to fine?. So static space isn't a bag of arbitrary points — it's a pre-organized semantic map, and it earns that organization purely from co-occurrence statistics, with no grounding in the world Can language models learn meaning without engaging the world?.
What attention adds is relational and directional. The Polar Probe finds that contextualized activations encode syntactic *type and direction* through angle and distance between embeddings — information about how this word relates to that one, which can't exist until both words are in play How do language models encode syntactic relations geometrically?. This is the cleaner way to read the division: static space holds *what a word means on its own*; contextualization computes *what it means here, in relation to these other words*. One framing pushes this even further — knowledge in transformers isn't stored and retrieved so much as it flows through the residual stream as activations, generated fresh in each pass rather than looked up Do transformer models store knowledge or generate it continuously?.
But the handoff between the two is contested territory, and that's the part you might not expect. The static layer's strong priors can overpower the contextual layer: models fail to integrate what's in front of them when parametric associations from training dominate, and no amount of prompting overrides it — you have to intervene in the representations directly Why do language models ignore information in their context?. A related failure shows models leaning on raw statistical mass: they consistently prefer high-frequency surface phrasings over semantically identical rare ones, suggesting the baseline layer tracks frequency, not meaning, more than we'd like Do language models really understand meaning or just surface frequency?. So the labor isn't always cleanly divided — sometimes the static priors refuse to yield the floor.
The most interesting move in the corpus is questioning whether the token is even the right unit for this split. Meta's Large Concept Model reasons over *sentence* embeddings in a language-agnostic space before decoding, suggesting the static-vs-contextual divide could be relocated to a higher level of abstraction entirely Can reasoning happen at the sentence level instead of tokens?. If you want to follow that thread, latent-thought models couple a fast local learning rate with a slow global one — a dual-rate scheme that looks a lot like formalizing the same division: a stable substrate plus a fast, context-sensitive layer on top Can latent thought vectors scale language models beyond parameters?.
Sources 9 notes
Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.
Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.
Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.