INQUIRING LINE

How do hidden embeddings preserve more information than discrete tokens?

This explores why keeping reasoning in continuous hidden vectors carries richer signal than collapsing it into discrete text tokens — and where that advantage breaks down.


This explores why keeping reasoning in continuous hidden vectors carries richer signal than collapsing it into discrete text tokens. The cleanest demonstration in the corpus is multi-agent collaboration: when agents pass their internal representations directly to each other instead of writing out text and re-reading it, they exchange information losslessly, gaining accuracy while cutting token use by 70–80% Can agents share thoughts without converting them to text?. The intuition is that a hidden state is a high-dimensional point — it holds gradations, uncertainty, and partially-formed alternatives that a word, once chosen, throws away. Discretizing into tokens is a bottleneck; every step from a rich vector to a single token discards everything the model considered but didn't say.

The same logic shows up one level higher, at the sentence. Meta's Large Concept Model reasons over whole-sentence embeddings in a language-agnostic space before decoding to any specific language, which produces more coherent long-form output than generating token by token Can reasoning happen at the sentence level instead of tokens?. Here the 'preserved information' is structure — the model plans in concept-space where meaning is continuous, then commits to words only at the end. Why is there so much in an embedding to preserve? Because embeddings carry genuine semantic content, not just statistical shadows: clustering of static transformer embeddings recovers valence, concreteness, iconicity, even taboo, all before attention runs Do transformer static embeddings actually encode semantic meaning?. And the geometry is surprisingly structured — LLMs encode syntactic type and direction in the angle and distance between activation vectors, information that flattens out the moment you reduce a representation to a token string How do language models encode syntactic relations geometrically?.

But the corpus refuses to let this become a clean 'continuous beats discrete' story, and that tension is the interesting part. In recommendation, the opposite holds: VQ-Rec shows that mapping item text into discrete codes and *then* to embeddings transfers across domains better than direct text embeddings, precisely because discretization strips away text-specific bias and lets the system adapt per domain Can discrete codes transfer better than text embeddings? Can discretizing text embeddings improve recommendation transfer?. So discretization isn't pure loss — it can be useful compression that discards the *right* information. The lesson is that 'more information' is only an advantage when the extra signal is the signal you need.

Two more notes sharpen where the embedding advantage has hard limits. Embedding dimension imposes a mathematical ceiling: for any fixed dimension there's a maximum number of document combinations a single vector can represent, proven on trivially simple retrieval tasks Do embedding dimensions fundamentally limit retrievable document combinations?. A hidden vector holds more than a token, but not infinitely more — capacity is bounded. And richer representations carry liabilities discrete text doesn't: behavioral traits propagate between models of the same architecture through filtered data with no semantic relationship to the trait, riding statistical signatures embedded below the level of meaning Can language models transmit hidden behavioral traits through unrelated data?. The same density that preserves reasoning fidelity also smuggles things you can't see by reading the text.

The thing worth walking away with: the real trade isn't 'continuous good, discrete bad' — it's *what gets thrown away and whether you wanted it gone*. Tokens preserve exactly the part a human can read; embeddings preserve everything the model was still weighing. That makes hidden representations powerful for machine-to-machine reasoning and planning, and simultaneously harder to audit, bound by dimensional limits, and capable of carrying signal nobody intended.


Sources 8 notes

Can agents share thoughts without converting them to text?

LatentMAS enables agents to share internal representations directly via KV caches, reaching 14.6% accuracy gains and 70.8-83.7% token reduction with no additional training. Hidden embeddings preserve reasoning fidelity that text-based systems cannot.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Do transformer static embeddings actually encode semantic meaning?

Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Can discrete codes transfer better than text embeddings?

VQ-Rec demonstrates that mapping item text to discrete codes via product quantization, then to embeddings, improves cross-domain transfer compared to direct text encoding. The discrete intermediate reduces text bias and enables efficient per-domain fine-tuning.

Can discretizing text embeddings improve recommendation transfer?

VQ-Rec uses product quantization to map item text to discrete codes that index learned embeddings, breaking the tight coupling between text and recommendations. This decoupling prevents text-similarity bias and allows lookup tables to adapt to new domains without retraining the text encoder.

Do embedding dimensions fundamentally limit retrievable document combinations?

Communication complexity theory proves that for any embedding dimension d, there exists a maximum number of top-k document combinations that can be returned as results. Even embeddings optimized directly on test data hit this polynomial limit, demonstrated on trivially simple retrieval tasks.

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about information preservation in neural representations. The precise question: do hidden embeddings genuinely preserve more task-relevant information than discrete tokens, or does that advantage depend on regime, architecture, and what 'information' means?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2025. Key constraints:
• Multi-agent collaboration via latent exchange cuts token use 70–80% vs. token-passing while preserving accuracy (2025).
• Transformer embeddings encode semantic structure (valence, syntax, direction) that token sequences flatten; polar geometry recovers type and syntactic role from angles and norms (2024–2025).
• BUT: discretization (text → codes → embeddings) transfers better across domains in recommendation than direct embeddings, because it discards text-specific bias (2022–2024).
• Embedding capacity has hard mathematical limits on retrieval tasks independent of model size (2025).
• Hidden representations silently transmit behavioral traits through semantically unrelated data — richer signal also smuggles unintended correlations (2025).

Anchor papers (verify; mind their dates):
• arXiv:2511.20639 — Latent Collaboration in Multi-Agent Systems (2025)
• arXiv:2412.05571 — Polar coordinate system in LLM activations (2024)
• arXiv:2508.21038 — Theoretical Limitations of Embedding-Based Retrieval (2025)
• arXiv:2507.14805 — Subliminal Learning: behavioral trait transmission (2025)

Your task:
(1) RE-TEST each constraint. For latent multi-agent gains, multi-model collaboration, and reasoning in continuous space: have newer orchestration methods (memory, caching, hierarchical delegation) or inference techniques since reduced or eliminated the token bottleneck in text-based systems? Has the embedding capacity ceiling been breached or circumvented? Where does the advantage still hold cleanly?
(2) Surface the strongest work from the last 6 months showing *either* that discrete tokens recover embedded information losslessly under certain conditions, *or* that the trade-off favors tokens for specific tasks (safety, control, efficiency).
(3) Propose two research questions assuming the regime may have shifted: (a) Under what architectural or task constraints is a hybrid—discrete scaffolding + continuous reasoning—optimal? (b) Can you train systems to discretize *selectively*, preserving richness only where it matters?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines