INQUIRING LINE

How much explicit verbal signal must latent chains retain to perform well?

This explores how much human-readable verbal content a model's reasoning steps actually need — whether latent or compressed reasoning can drop most of the words and still think, or whether the explicit verbal signal is doing the real work.


This explores how much human-readable verbal content a model's reasoning steps actually need — whether 'latent' or compressed chains can shed most of their words and still reason well. The corpus answers from two directions, and they pull apart in an interesting way. From the compression side, the verbal signal turns out to be mostly disposable: Chain of Draft matches full chain-of-thought accuracy on arithmetic, symbolic, and commonsense tasks while keeping only 7.6% of the tokens, meaning roughly 92% of a normal reasoning trace is style and documentation, not computation Can minimal reasoning chains match full explanations?. So a chain can throw away almost all of its explicit verbiage and lose nothing — the words were never the load-bearing part.

But 'how few words' is the wrong frame; what matters is what the remaining signal is doing. Research on what makes chain-of-thought work at all finds that format and spatial structure shape reasoning roughly 7.5× more than the actual logical content, and that even invalid reasoning steps can work as well as valid ones What makes chain-of-thought reasoning actually work?. The verbal signal isn't valued for its meaning — it's a scaffold that keeps the model's generation on a productive track. That reframes your question: latent chains don't need verbal content for its semantics, they need whatever keeps the trajectory structured.

This is exactly where fully latent reasoning gets into trouble, and the corpus is sharp about why. When you remove the verbal anchor entirely and supervise only on the final answer, latent chain-of-thought collapses two ways at once: gradients attenuate along the latent steps so the intermediate reasoning never gets trained, and the latent space drifts because nothing grounds it semantically Why does latent chain-of-thought fail so easily in training?. The lesson is that the explicit verbal signal was quietly providing two services — dense step-by-step training signal and a semantic tether — and if you drop the words you have to replace both, not neither.

So the leading-edge work isn't about retaining verbal signal; it's about manufacturing substitutes for what the words provided. Normalizing-flow continuous thoughts recover an exact likelihood for non-verbal reasoning steps, which restores the ability to sample, score, and refine trajectories that pure latent vectors had lost — buying back the tractability that text gave you for free Can continuous thoughts have tractable likelihoods for sampling and scoring?. Latent-thought language models go further, treating the latent trace as its own scaling dimension with a fast inner loop that fits the per-problem thought and a slow outer loop that learns the decoder Can latent thought vectors scale language models beyond parameters?. Both are ways of keeping reasoning structured without keeping it verbal.

The surprise worth leaving with: the model itself already does something like silent reasoning. Under hard, unfamiliar tasks, hidden states sparsify in a localized, systematic way that correlates with reasoning load and actually stabilizes performance — internal computation reorganizing without any verbal trace at all Do language models sparsify their activations under difficult tasks?. Put it together and the answer to 'how much explicit verbal signal must latent chains retain?' is: almost none, in principle — but only if you separately supply the dense training signal and the semantic grounding that the words were silently carrying. Strip the words and forget what they were doing, and the chain quietly falls apart.


Sources 6 notes

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

Why does latent chain-of-thought fail so easily in training?

Outcome supervision alone causes gradient attenuation along latent steps and lets the latent space wander without semantic grounding. Robust latent reasoning requires both dense trajectory supervision and space supervision that preserves geometric structure rather than compressing it.

Can continuous thoughts have tractable likelihoods for sampling and scoring?

NF-CoT models continuous thoughts as an autoregressive normalizing flow inside the LLM's causal stream, recovering exact likelihood, probabilistic sampling, and KV-cache compatibility. This enables policy-gradient refinement and trajectory scoring on non-verbal reasoning, matching the tractability of textual CoT.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Next inquiring lines