Does reasoning happen in hidden space or in generated tokens?
This explores whether an LLM's actual 'thinking' lives in the hidden states it never shows you, or in the visible chain-of-thought tokens it writes out — and the corpus suggests the honest answer is 'mostly the hidden space, with the text as a partial interface.'
This question is really asking where the work of reasoning gets done: in the model's internal hidden states, or in the words it generates on screen. The corpus leans hard toward the hidden space. The cleanest framing comes from a proposal to study reasoning as the formation of latent-state trajectories rather than as the surface text it produces — on this view, the written chain-of-thought is a partial interface onto a process that's already running underneath Where does LLM reasoning actually happen during generation?. Architectures built to skip verbalization entirely back this up: depth-recurrent models, Heima, and Coconut all scale test-time compute by iterating hidden states instead of emitting tokens, which implies the visible 'thinking out loud' is a training artifact rather than a requirement for reasoning Can models reason without generating visible thinking tokens?.
The most striking evidence that the two can come apart is mechanistic. Using a 'logit lens' to peek inside, researchers found models that compute the correct answer in layers 1–3 and then actively overwrite those representations in the final layers to emit format-compliant filler — the real reasoning is fully recoverable from the lower-ranked predictions the model chose not to say Do transformers hide reasoning before producing filler tokens?. In the same spirit, activation probes show models often commit to an answer internally long before they finish writing their reasoning, especially on easy problems where the chain-of-thought is essentially performance — though on genuinely hard tasks the written steps do track real internal belief updates Does chain-of-thought reasoning reflect genuine thinking or performance?.
So if the answer is mostly determined in hidden space, what are the tokens for? Several notes suggest they function more as computational scaffolding than as meaning. Models trained on deliberately corrupted, semantically wrong traces perform about as well as those trained on correct ones — sometimes generalizing better — which is hard to square with the text being where the thinking happens Do reasoning traces need to be semantically correct?. Relatedly, the format and spatial structure of a chain-of-thought shapes reasoning far more than its logical content, and invalid prompts work as well as valid ones: CoT is pattern-guided generation, not formal logic What makes chain-of-thought reasoning actually work?.
But 'mostly hidden' isn't 'tokens don't matter,' and this is the twist worth carrying away: not all tokens are equal. A small minority of generated tokens carry almost all the reasoning load. 'Thinking' tokens like 'Wait' and 'Therefore' spike in mutual information with the correct answer, and suppressing them — but not random tokens — damages reasoning Do reflection tokens carry more information about correct answers?. Roughly 20% of tokens are high-entropy 'forking points' where the model genuinely decides, and training only on those matches full training Do high-entropy tokens drive reasoning model improvements?; pruning studies similarly show models internally rank tokens by functional importance, preserving symbolic computation first Which tokens in reasoning chains actually matter most?. So the picture is layered: the bulk of reasoning is latent, but specific generated tokens are the visible joints where hidden trajectories pivot.
The frontier of this question is dissolving the binary altogether. Large Concept Models reason over sentence embeddings in a language-agnostic space before decoding to any language Can reasoning happen at the sentence level instead of tokens?, and diffusion LLMs decouple reasoning from answering entirely — refining 'thinking' in masked positions alongside the answer, with answer confidence often converging before the reasoning finishes Can reasoning and answers be generated separately in language models?. The direction of travel: reasoning is a hidden-state process, and the generated tokens are a steerable, partly optional readout of it.
Sources 11 notes
Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Activation probes show models commit to answers internally long before finishing their reasoning on easy tasks, but on hard tasks the reasoning process tracks real belief updates with detectable inflection points. Probe-guided early exit reduces tokens by up to 80 percent without accuracy loss.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.
ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.