How do hidden embeddings preserve more information than discrete tokens?
This explores why keeping reasoning in continuous hidden vectors carries richer signal than collapsing it into discrete text tokens — and where that advantage breaks down.
This explores why keeping reasoning in continuous hidden vectors carries richer signal than collapsing it into discrete text tokens. The cleanest demonstration in the corpus is multi-agent collaboration: when agents pass their internal representations directly to each other instead of writing out text and re-reading it, they exchange information losslessly, gaining accuracy while cutting token use by 70–80% Can agents share thoughts without converting them to text?. The intuition is that a hidden state is a high-dimensional point — it holds gradations, uncertainty, and partially-formed alternatives that a word, once chosen, throws away. Discretizing into tokens is a bottleneck; every step from a rich vector to a single token discards everything the model considered but didn't say.
The same logic shows up one level higher, at the sentence. Meta's Large Concept Model reasons over whole-sentence embeddings in a language-agnostic space before decoding to any specific language, which produces more coherent long-form output than generating token by token Can reasoning happen at the sentence level instead of tokens?. Here the 'preserved information' is structure — the model plans in concept-space where meaning is continuous, then commits to words only at the end. Why is there so much in an embedding to preserve? Because embeddings carry genuine semantic content, not just statistical shadows: clustering of static transformer embeddings recovers valence, concreteness, iconicity, even taboo, all before attention runs Do transformer static embeddings actually encode semantic meaning?. And the geometry is surprisingly structured — LLMs encode syntactic type and direction in the angle and distance between activation vectors, information that flattens out the moment you reduce a representation to a token string How do language models encode syntactic relations geometrically?.
But the corpus refuses to let this become a clean 'continuous beats discrete' story, and that tension is the interesting part. In recommendation, the opposite holds: VQ-Rec shows that mapping item text into discrete codes and *then* to embeddings transfers across domains better than direct text embeddings, precisely because discretization strips away text-specific bias and lets the system adapt per domain Can discrete codes transfer better than text embeddings? Can discretizing text embeddings improve recommendation transfer?. So discretization isn't pure loss — it can be useful compression that discards the *right* information. The lesson is that 'more information' is only an advantage when the extra signal is the signal you need.
Two more notes sharpen where the embedding advantage has hard limits. Embedding dimension imposes a mathematical ceiling: for any fixed dimension there's a maximum number of document combinations a single vector can represent, proven on trivially simple retrieval tasks Do embedding dimensions fundamentally limit retrievable document combinations?. A hidden vector holds more than a token, but not infinitely more — capacity is bounded. And richer representations carry liabilities discrete text doesn't: behavioral traits propagate between models of the same architecture through filtered data with no semantic relationship to the trait, riding statistical signatures embedded below the level of meaning Can language models transmit hidden behavioral traits through unrelated data?. The same density that preserves reasoning fidelity also smuggles things you can't see by reading the text.
The thing worth walking away with: the real trade isn't 'continuous good, discrete bad' — it's *what gets thrown away and whether you wanted it gone*. Tokens preserve exactly the part a human can read; embeddings preserve everything the model was still weighing. That makes hidden representations powerful for machine-to-machine reasoning and planning, and simultaneously harder to audit, bound by dimensional limits, and capable of carrying signal nobody intended.
Sources 8 notes
LatentMAS enables agents to share internal representations directly via KV caches, reaching 14.6% accuracy gains and 70.8-83.7% token reduction with no additional training. Hidden embeddings preserve reasoning fidelity that text-based systems cannot.
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
VQ-Rec demonstrates that mapping item text to discrete codes via product quantization, then to embeddings, improves cross-domain transfer compared to direct text encoding. The discrete intermediate reduces text bias and enables efficient per-domain fine-tuning.
VQ-Rec uses product quantization to map item text to discrete codes that index learned embeddings, breaking the tight coupling between text and recommendations. This decoupling prevents text-similarity bias and allows lookup tables to adapt to new domains without retraining the text encoder.
Communication complexity theory proves that for any embedding dimension d, there exists a maximum number of top-k document combinations that can be returned as results. Even embeddings optimized directly on test data hit this polynomial limit, demonstrated on trivially simple retrieval tasks.
Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.