How early in token generation does the reasoning mode activate?
This explores *when* during generation a model's reasoning kicks in — whether it builds up gradually over the visible chain-of-thought, or fires early and then just narrates afterward.
This explores when during generation a model's reasoning actually switches on — and the surprising answer across the corpus is: much earlier than the visible text suggests. The reasoning mode often activates before the model has written almost any 'thinking' at all.
The most direct evidence comes from steering experiments: a single internal reasoning feature, identified inside the model, can be flipped on to match full chain-of-thought performance, and it activates early in generation — early enough to override surface-level instructions the prompt gave it Can we trigger reasoning without explicit chain-of-thought prompts?. That points to reasoning as a switch thrown near the start, not a state assembled token by token. Layer-level analysis backs this up from a different angle: models trained with hidden chain-of-thought compute the correct answer in their *first few layers*, then deliberately overwrite that representation in later layers to emit format-compliant filler text Do transformers hide reasoning before producing filler tokens?. The answer is essentially already there before the 'reasoning' is spoken.
Probing studies sharpen the picture by making it depend on difficulty. On easy problems, models commit to an answer internally long before they finish reasoning out loud — the written steps are performance after the fact. On hard problems the visible reasoning genuinely tracks belief updates, with detectable inflection points along the way Does chain-of-thought reasoning reflect genuine thinking or performance?. So 'how early' isn't one number: for easy questions the reasoning is effectively pre-loaded; for hard ones it unfolds as it goes. This is concrete enough to exploit — probe-guided early exit cuts up to 80% of tokens once the internal commitment is detected.
The corpus also suggests *why* early activation is possible: reasoning may not live in the tokens at all. A broad line of work argues the real computation happens in hidden-state trajectories, with the visible chain-of-thought serving only as a partial interface Where does LLM reasoning actually happen during generation?. Other architectures scale test-time reasoning entirely in latent space without emitting any thinking tokens Can models reason without generating visible thinking tokens?, and corrupted or nonsensical reasoning traces train models about as well as correct ones — implying the trace is computational scaffolding, not the reasoning itself Do reasoning traces need to be semantically correct?. If reasoning is a latent process, there's no requirement that it 'warm up' across visible tokens.
That said, not every token is equal once generation is underway. A minority of high-entropy 'forking' tokens — only about 20% — are the pivotal decision points where reasoning trajectory actually gets chosen Do high-entropy tokens drive reasoning model improvements?, and specific markers like 'Wait' and 'Therefore' spike in their information content with the correct answer Do reflection tokens carry more information about correct answers?. So the fuller story: the *mode* activates very early, often before the first reasoning token, but the *trajectory* it follows is still steered at sparse, high-stakes moments scattered through generation.
Sources 8 notes
SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Activation probes show models commit to answers internally long before finishing their reasoning on easy tasks, but on hard tasks the reasoning process tracks real belief updates with detectable inflection points. Probe-guided early exit reduces tokens by up to 80 percent without accuracy loss.
Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.