How do thinking tokens function as mutual information peaks in reasoning?
This explores the finding that certain words in a model's chain-of-thought (like 'Wait' or 'Therefore') carry unusually high information about whether the final answer is correct — and what that tells us about how reasoning actually works.
This explores the finding that certain words in a model's chain-of-thought (like 'Wait' or 'Therefore') carry unusually high information about whether the final answer is correct. The core result is that reasoning isn't spread evenly across a model's output — it concentrates. A small set of reflection or transition tokens show sharp spikes in mutual information with the correct answer, and you can prove they matter: suppress those specific tokens and accuracy drops, but suppress an equal number of random tokens and nothing happens. Even better, recycling the model's representation at those peaks improves accuracy by ~20% Do reflection tokens carry more information about correct answers?.
What makes this genuinely interesting is how it converges with a separate line of work coming at reasoning from the reinforcement-learning side. There, researchers found that only about 20% of tokens are high-entropy 'forking points' — moments where the model is genuinely deciding which way to go — and that training RLVR exclusively on those tokens matches or beats updating on everything Do high-entropy tokens drive reasoning model improvements?. Two different lenses, information peaks and entropy peaks, land on the same picture: a sparse minority of tokens carries the reasoning signal, and the rest is filler around them.
But here's the tension the corpus surfaces, and it's the thing worth knowing. A skeptical line of work argues that intermediate reasoning tokens are stylistic mimicry, not causally necessary computation — invalid traces frequently still produce correct answers, suggesting the tokens correlate with answers via learned formatting rather than functional reasoning Do reasoning traces actually cause correct answers?. Relatedly, chain-of-thought often works even when the logical content is wrong, because format and spatial structure matter far more than the actual reasoning steps What makes chain-of-thought reasoning actually work?. So which is it — are these peak tokens load-bearing, or theater? The likely reconciliation is that it depends on difficulty: probes show models commit to easy answers internally before the reasoning even finishes (performative), while on hard problems the trace tracks real belief updates with detectable inflection points Does chain-of-thought reasoning reflect genuine thinking or performance?. The information peaks are probably real where reasoning is real.
Two further threads make this less abstract than it sounds. First, if specific tokens are the information-bearing parts, you'd expect more tokens isn't simply better — and it isn't: accuracy peaks then declines as thinking length grows, with models overthinking easy problems Does more thinking time always improve reasoning accuracy?, and the optimal threshold stubbornly resists prediction across models and tasks How can we predict the optimal thinking token threshold?. Second, whether thinking even helps depends on training: vanilla models use a 'thinking mode' to spiral into counterproductive self-doubt, and RL training is what flips the same mechanism into productive analysis Does extended thinking help or hurt model reasoning?.
The deepest doorway, though, is the question of whether verbalization is required at all. If reasoning lives in a few informational peaks, maybe the surrounding language is scaffolding the model doesn't strictly need — and indeed, latent-reasoning architectures scale test-time compute through hidden-state iteration without generating any visible thinking tokens, suggesting verbalization is a training artifact rather than a reasoning requirement Can models reason without generating visible thinking tokens?. The mutual-information peak story and the latent-reasoning story are two ends of the same thread: reasoning is concentrated, and the words may be where it surfaces, not where it happens.
Sources 9 notes
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
Activation probes show models commit to answers internally long before finishing their reasoning on easy tasks, but on hard tasks the reasoning process tracks real belief updates with detectable inflection points. Probe-guided early exit reduces tokens by up to 80 percent without accuracy loss.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
The overthinking threshold depends on task difficulty, model training, and domain, but remains invisible until crossed. Recent work suggests difficulty estimators and runtime confidence signals can detect thresholds dynamically.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.