INQUIRING LINE

Does latent manipulation outperform token-level prediction for efficiency?

This explores whether models that 'think' in their own internal representation space — rather than predicting one token at a time — actually learn faster and run leaner, and where that advantage holds or breaks.


This explores whether latent manipulation (reasoning over a model's own internal vectors or concept embeddings) beats next-token prediction on efficiency — and the corpus has a surprisingly strong, formal answer for part of it. There's a proof that predicting your own latents is *exponentially* more sample-efficient than predicting tokens Why is predicting latents more sample-efficient than tokens?. The reason is intuitive once named: same-level latent representations are far more correlated with each other than raw tokens are, so a model recovers compositional, hierarchical structure with a roughly constant number of samples — while token-level learning needs exponentially more to see the same structure. That's the cleanest 'yes' in the collection.

But efficiency comes in flavors — sample efficiency, parameter efficiency, and what you might call representational efficiency — and the corpus pulls them apart. Latent-Thought Language Models add scaling dimensions that have nothing to do with parameter count: a fast inner loop learns per-input latent vectors while a slow outer loop learns the decoder, yielding better sample *and* parameter efficiency than scaling weights alone Can latent thought vectors scale language models beyond parameters?. Meta's Large Concept Models push the same idea up a level, reasoning over whole-sentence embeddings in a language-agnostic space before decoding Can reasoning happen at the sentence level instead of tokens?. And looped architectures get reasoning gains by re-applying the same layers in recurrent depth rather than adding width — recursion buying what scale can't Can models learn by looping instead of growing larger?. The throughline: working in latent space lets you decouple capability from raw token-by-token, parameter-by-parameter growth.

Here's the twist that makes this more than a cheerleading exercise — token-level prediction isn't uniformly wasteful, it's *unevenly* wasteful. Only about 20% of tokens are high-entropy 'forking points' that actually carry the learning signal; train on those alone and you match full-gradient RLVR Do high-entropy tokens drive reasoning model improvements?. Models even rank their own tokens by functional importance, preserving symbolic-computation tokens while pruning grammar and filler Which tokens in reasoning chains actually matter most?. So the real story may not be 'latents beat tokens' but 'most tokens are dead weight, and latent methods are one way to skip them.' Strikingly, transformers already do something latent-like internally — they compute correct answers in early layers, then overwrite them to emit format-compliant filler tokens Do transformers hide reasoning before producing filler tokens?. The efficient computation is happening in latent space; the token layer is partly theater.

Where latents *lose*: efficiency isn't the only axis. Transformers provably beat fixed-size-latent state-space models at copying and retrieving from context, precisely because a compressed latent state can't hold arbitrarily long sequences Can state-space models match transformers at copying and retrieval?. That's the catch with manipulating a bounded internal representation — it's sample-efficient for learning structure but lossy for verbatim recall. So the honest synthesis is: latent prediction wins decisively on sample efficiency for learning compositional structure, adds parameter-efficiency dimensions token-scaling can't reach, but trades away exact retrieval — which is exactly why the frontier looks hybrid (latent reasoning, token decoding) rather than one replacing the other.


Sources 8 notes

Why is predicting latents more sample-efficient than tokens?

A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Can models learn by looping instead of growing larger?

Models that re-apply layers in recurrent depth outperform larger feedforward networks on reasoning tasks. This works because recursion enables state tracking and compositional generalization that parameter scaling alone cannot achieve, with convergence signals providing natural halting.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can state-space models match transformers at copying and retrieval?

Two-layer transformers can copy exponentially long strings while state-space models are fundamentally limited by their fixed-size latent state. Empirically, transformers dramatically outperform SSMs at copying and context retrieval in both synthetic and pretrained settings.

Next inquiring lines