INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do language models represent m…›What articulatory information do s…›this inquiring line

Discrete tokens don't hold onto information better than smooth embeddings — they intentionally throw it away, and that's exactly why they sometimes work better.

Do discrete tokenized modalities preserve information better than continuous embeddings?

This explores whether chopping a modality (text, image, audio) into discrete tokens holds onto more of the original signal than mapping it into a smooth continuous embedding space — and the corpus suggests that's the wrong axis to judge them on.

This explores whether discrete tokens preserve information better than continuous embeddings, and the honest answer from the corpus is that discretization usually *throws information away on purpose* — and that loss is often the point. When VQ-Rec maps an item's text through discrete codes before turning it into an embedding, it deliberately compresses away fine-grained textual detail, and that's exactly why it transfers across domains better than a direct text embedding: the discrete bottleneck strips out text-similarity bias that would otherwise overfit to one domain's vocabulary Can discrete codes transfer better than text embeddings? Can discretizing text embeddings improve recommendation transfer?. So the win isn't preservation — it's useful forgetting.

Where discrete tokens genuinely shine is composition and unification, not fidelity. A model like MIO trains on mixed discrete tokens across four modalities and gains abilities — interleaved video-text output, chain-of-visual-thought reasoning — that dual-encoder systems built on continuous features can't do, precisely because a shared discrete vocabulary lets one autoregressive model treat everything as the same kind of symbol Can a single model generate all modalities without external encoders?. Discreteness buys you a common substrate, not richer detail.

The continuous side, meanwhile, wins on the dimension the question doesn't ask about: learning efficiency. There's a formal result that predicting your own continuous latents recovers compositional structure exponentially faster than predicting discrete tokens, because neighboring latents are far more correlated than raw tokens are Why is predicting latents more sample-efficient than tokens?. And reasoning can happen entirely in continuous sentence-embedding space, language-agnostic, before any tokens get decoded at all Can reasoning happen at the sentence level instead of tokens?. So if anything, continuous representations preserve *relational* structure that token boundaries fragment.

The deeper twist is that not all tokens carry equal information regardless of format. In reasoning chains, models internally rank tokens by function — symbolic-computation tokens are preserved while grammar and filler get pruned first Which tokens in reasoning chains actually matter most? — and only about 20% of tokens, the high-entropy 'forking points,' actually drive learning Do high-entropy tokens drive reasoning model improvements?. Information density is wildly uneven inside the token stream itself, which means 'discrete vs. continuous' is less important than *which* parts of the signal a representation chooses to keep sharp.

The thing you didn't know you wanted to know: both formats sit downstream of a bigger loss. Text itself is a lossy human abstraction that already strips physics, geometry, and causality before any tokenizer or embedder touches it Are text-only language models fundamentally limited by abstraction? — and a plain natural-language *description* of an image can bridge a recognition task better than raw embedding similarity does Can describing images in text improve zero-shot recognition?. So the real question isn't whether discrete tokens preserve more than continuous embeddings, but which abstraction keeps the information *your task actually needs* — and both are deliberately, usefully lossy.

Sources 9 notes

Can discrete codes transfer better than text embeddings?

VQ-Rec demonstrates that mapping item text to discrete codes via product quantization, then to embeddings, improves cross-domain transfer compared to direct text encoding. The discrete intermediate reduces text bias and enables efficient per-domain fine-tuning.

Can discretizing text embeddings improve recommendation transfer?

VQ-Rec uses product quantization to map item text to discrete codes that index learned embeddings, breaking the tight coupling between text and recommendations. This decoupling prevents text-similarity bias and allows lookup tables to adapt to new domains without retraining the text encoder.

Can a single model generate all modalities without external encoders?

MIO trains a foundation model on mixed discrete tokens across four modalities with causal modeling, achieving end-to-end generation in both directions. The model emergently produces interleaved video-text output and chain-of-visual-thought reasoning that dual-modal encoder-based systems cannot.

Why is predicting latents more sample-efficient than tokens?

A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Show all 9 sources

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Can describing images in text improve zero-shot recognition?

SignRAG demonstrates that describing an unknown image via vision-language model, then retrieving known designs from a text-indexed database, eliminates the need for recognition model training. Natural-language description bridges the visual-reference gap better than direct embedding similarity.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether discrete tokens preserve information better than continuous embeddings — a question that may have shifted under recent capability and training advances. Here's what a curated library found (2022–2026, dated claims):

— Discretization *throws information away deliberately*: VQ-Rec compresses text through discrete codes to strip domain-specific vocabulary bias, enabling transfer; the win is useful forgetting, not preservation (2022–2024).
— Discrete tokens unify modalities under one autoregressive substrate (MIO, 2024), but continuous latents learn compositional structure exponentially faster than predicting discrete tokens (2025).
— Reasoning happens efficiently in continuous sentence-embedding space before decoding; only ~20% of tokens (high-entropy forking points) drive learning; token function matters more than format (2025–2026).
— Text itself is lossy; natural-language VLM descriptions sometimes outperform raw embeddings on zero-shot tasks (2024).
— Recent work questions whether "zero-shot" is coherent without exponential pretraining data (2024).

Anchor papers (verify dates): arXiv:2210.12316 (2022), arXiv:2409.17692 (2024), arXiv:2412.06769 (2024), arXiv:2506.01939 (2026).

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, determine whether newer models (o1, o3, Claude 4, Grok-3), training methods (DPO, RL-from-latents, continuous reasoning), tooling (token-budget optimizers, mixed-modality caching), or evaluation protocols have since relaxed or overturned it. Separate the durable question ("which representation fits which task?") from perishable limits ("discrete tokens can't compose"; "continuous reps are slower"). Cite what changed it.
(2) Surface the strongest *contradicting* or *superseding* work from the last ~6 months — especially anything claiming discrete tokenization is *necessary* for generalization, or that continuous latent reasoning has hit a ceiling.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Does end-to-end continuous reasoning + on-demand discretization (Gumbel-softmax annealing) now match or exceed token-based composition?" and "Does task-specific entropy pruning (keeping only the ~20% forking tokens) eliminate format sensitivity?"

Cite arXiv IDs; flag anything you cannot ground.

Discrete tokens don't hold onto information better than smooth embeddings — they intentionally throw it away, and that's exactly why they sometimes work better.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8