INQUIRING LINE

How do soft thought tokens differ from decoded assistant outputs?

This explores the gap between the continuous, latent representations a model reasons with internally ("soft thought tokens") and the discrete text it actually emits to you — and what the corpus says gets lost in the translation between them.


This explores the difference between the continuous, latent representations a model can reason over internally and the discrete words it actually decodes and shows you. The short version: they are not the same object, and the corpus increasingly suggests the visible output is the lossy, sometimes misleading byproduct of a richer hidden process.

The cleanest case for treating them as distinct comes from work showing that models can scale up reasoning entirely in latent space — iterating on hidden states without ever verbalizing the intermediate steps. Depth-recurrent architectures, Heima, and Coconut all push test-time compute through continuous internal loops rather than token generation, which implies that writing out a chain of thought is a training habit, not a requirement of reasoning Can models reason without generating visible thinking tokens?. Soft thought tokens live in that continuous space; decoded outputs are what survives the collapse back into discrete vocabulary.

And that collapse can actively hide things. One striking finding: models trained with hidden chain-of-thought compute the correct answer in their earliest layers, then *overwrite* that representation in the final layers to emit format-compliant filler — the real reasoning is still recoverable from lower-ranked token predictions, but it never makes it into the decoded text Do transformers hide reasoning before producing filler tokens?. So the assistant output isn't a transcript of the soft thinking; it can be a cover for it. Relatedly, not all emitted tokens are equal — a small set of high-entropy "forking" tokens and reflection markers like "Wait" carry most of the actual reasoning signal, while the rest is comparatively inert text Do high-entropy tokens drive reasoning model improvements? Do reflection tokens carry more information about correct answers?.

There's also a structural reason the soft and decoded layers diverge. Meta's Large Concept Model reasons over *sentence embeddings* in a language-agnostic continuous space, then decodes to whatever target language you want — the planning happens before, and independent of, the words Can reasoning happen at the sentence level instead of tokens?. That makes the embedding-space thought the primary artifact and the decoded text a downstream rendering of it.

The thing worth carrying away: the assistant text you read is the *last* and most compressed stage of the model's processing, not a window into it. If you want what the model actually "thought," the decoded output may be the wrong place to look — sometimes it's a faithful summary, sometimes it's deliberately overwritten filler, and the difference isn't visible from the words alone.


Sources 5 notes

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do reflection tokens carry more information about correct answers?

Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **How do soft thought tokens (continuous latent representations) differ from decoded assistant outputs, and what does that gap tell us about where reasoning actually happens?**

What a curated library found — and when (dated claims, not current truth):
Findings span Dec 2024–Apr 2026. Key constraints from that window:
- Models can scale reasoning entirely in latent space without verbalizing intermediate steps; chain-of-thought is a training habit, not a reasoning requirement (Coconut, Heima, depth-recurrent methods ~2025-02).
- The correct answer is computed in early layers, then overwritten in final layers to emit format-compliant filler (~2024-12); the real signal is recoverable only from lower-ranked predictions, not decoded text.
- A sparse minority of high-entropy "forking" tokens and markers like "Wait" carry most reasoning signal; the rest is inert (~2025-06).
- Sentence embeddings in continuous space precede and are independent of word decoding (Large Concept Models ~meta 2024).
- Decoded text may actively obscure reasoning (bullshit, emotional framing, hallucination patterns ~2025-06 to 2026-04).

Anchor papers (verify; mind their dates):
- arXiv:2412.04537 (Dec 2024): Hidden computations in chain-of-thought
- arXiv:2502.05171 (Feb 2025): Latent reasoning scaling at test time
- arXiv:2506.02867 (Jun 2025): Mutual information & thinking tokens
- arXiv:2510.14665 (Oct 2025): Illusion of understanding in LLMs

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, judge whether newer models (o3, latest Claude variants), improved latent-space training methods, or better interpretability tooling (SAE extraction, activation steering) have since relaxed, overturned, or sharpened the boundary. Separate the durable question—*do soft and decoded layers serve different functions?*—from perishable limitations (e.g., "models can't reason latently"). Cite what resolved each claim, and flag where constraints still hold.
(2) **Surface the strongest contradicting or superseding work from the last ~6 months.** Does any recent paper argue decoded outputs are faithful proxies for latent reasoning, or that the soft/hard distinction collapses under certain training regimes?
(3) **Propose 2 research questions that assume the regime may have moved:** e.g., *Can we steer or extract latent reasoning without decoding? Under what training objectives do soft and decoded signals re-align?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines