Do reflection tokens carry more information about correct answers?
Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.
By tracking mutual information (MI) between intermediate representations and the correct answer at each step of LRM reasoning, an interesting phenomenon emerges: MI spikes suddenly at specific steps, creating sparse, non-uniform "MI peaks" throughout the reasoning process.
These peaks overwhelmingly correspond to tokens expressing reflection, self-correction, or transitions — "Wait," "Hmm," "Therefore," "So" — which the authors term "thinking tokens." Three key findings:
Thinking tokens are functionally necessary. Fully suppressing them significantly harms reasoning performance. Randomly suppressing the same number of tokens has minimal impact. The information is concentrated in the thinking tokens, not distributed across the trace.
MI peaks are a training artifact. Base models (e.g., LLaMA-3.1-8B) do not exhibit the MI peaks phenomenon clearly. The distinct pattern emerges from reasoning-intensive training (RL post-training). This suggests reasoning training teaches models to concentrate information at specific reflection points.
Two practical improvements follow. Representation Recycling (allowing MI-peak representations to iterate through the model multiple times) improves accuracy by 20% on AIME24. Thinking Token Test-time Scaling (forcing continued reasoning from thinking tokens when budget remains) yields steady performance improvements.
This provides an information-theoretic complement to the sentence-level thought anchors finding. Which sentences actually steer a reasoning trace? identifies planning and backtracking sentences via counterfactual, attention, and causal suppression methods. MI peaks identify the same pivotal role via information theory — converging from a different analytical direction.
The convergence across methods (counterfactual importance, attention patterns, causal suppression, and now mutual information) and across granularity levels (token-level MI peaks, sentence-level thought anchors, RLVR's high-entropy forking tokens) strongly supports the claim that reasoning traces have a sparse-pivot structure. Most tokens are filler; a small subset carries the reasoning signal.
Inquiring lines that use this note as a source 81
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do one-shot transparency studies miss the temporal reversal entirely?
- How can we measure whether assistance preserved the user's reasoning state?
- How does smooth probabilistic flow differ from turbulent rhetorical exploration?
- Why do tokens need validators while commodities need standardization?
- What distinguishes evaluative stance-taking from the mechanical conformity shape-holding describes?
- Why does the first generated token trigger collapse of task superposition?
- Why does scaling reasoning tokens fail to improve unfamiliar tasks?
- What distinguishes genuine reasoning activation from memorization-assisted answer recall?
- What is the critical thinking token threshold beyond which accuracy degrades?
- Can reflection in reasoning models be corrective rather than just confirmatory?
- How do soft thought tokens differ from decoded assistant outputs?
- How do thinking tokens exhibit diminishing returns beyond a critical threshold?
- What behavioral markers signal when reasoning chains are performative?
- Do tokens beyond a critical threshold actually improve reasoning quality?
- What makes some tokens carry disproportionate information about answers?
- Why do transformers weight early tokens more heavily than later ones?
- How do ordering effects compound across different prompt component scales?
- What makes intentional structure shifts different from segment boundaries?
- What does attentional state look like in a static context window?
- Does thinking-token overuse actually degrade reasoning accuracy in practice?
- How do attention heads separate text retrieval from internal thought representation?
- What determines the optimal thinking token threshold for a given task?
- What makes symbolic operations different from general knowledge questions?
- Why does reasoning accuracy degrade beyond a critical thinking token threshold?
- How do thinking tokens function as mutual information peaks in reasoning?
- What is event-residue and how does it differ from utterances?
- Can derivational traces be distinguished from stylistic mimicry of reasoning?
- How does the EAFR schema distinguish between reflection and action in conversation?
- Can token efficiency come from stopping before reflection?
- How do smaller models respond to longer reflection prompts?
- Does reflection destabilize reasoning in dynamic environments?
- Why does reflection in reasoning models stay confirmatory instead of corrective?
- Why does the same recalled information lead to different reasoning conclusions?
- Does thought consolidation address the confirmatory reflection problem in reasoning models?
- How does the [remention] token help models distinguish initial from later mentions?
- Do attention scores predict which tokens will be pruned first?
- Do reflection tokens and symbolic tokens serve different roles in reasoning?
- What happens to reasoning accuracy when models use more thinking tokens?
- Which sentences in reasoning traces actually influence the final answer?
- How do partial credit grading systems accidentally reward reasoning theater?
- What distinguishes reflection that satisfies constraints from reflection that merely sounds reflective?
- Why does reflection in reasoning models tend to be confirmatory rather than corrective?
- How do execution and planning tokens differ in their entropy dynamics?
- Do thought anchors correspond mechanistically to planning tokens in RL?
- How does chain-of-thought length affect attention to constraint tokens?
- Does verbal step-by-step reflection preserve learning signals that abstraction removes?
- How early in token generation does the reasoning mode activate?
- Why does reflection in reasoning models confirm rather than correct initial directions?
- How does self-referential processing transfer to other reasoning tasks?
- Can early stopping on reflection tokens save computation without accuracy loss?
- How does tokenization change what gets counted as valuable knowledge?
- Why do final answers contradict what the thinking draft explicitly concluded?
- How does confirmatory reflection differ from corrective self-evaluation in models?
- Why does representation recycling of MI-peak tokens improve reasoning accuracy?
- Can thinking token density explain reasoning performance beyond total length?
- What distinguishes memorized tokens from causally necessary reasoning steps?
- What semantic information is lost if analysis skips the token embedding layer?
- How much does switching overhead reduce reasoning token efficiency?
- When should persona attention weight activate versus stay dormant during scoring?
- How does reflection-based query refinement differ from single-pass retrieval strategies?
- Can knowledge density per token be measured as a quality metric?
- Which tokens actually change across different reasoning paths in rollouts?
- Can explicit reflection during AI-assisted work improve transfer of learning?
- What makes thinking tokens carry more information than other tokens?
- Can models internally identify which tokens matter most for reasoning?
- How do thought anchors differ from individual forking tokens mechanistically?
- Does reasoning happen in hidden space or in generated tokens?
- Does next-token prediction actually explain how human thought works?
- Do models cache intentions about response topics before generating the first token?
- How do meta-tokens help models learn when to generate reasoning versus commit predictions?
- How does predictive accuracy on future tokens differ from correctness on labeled answers?
- What makes token-level reasoning during pretraining different from test-time chain-of-thought?
- How do continuous concept tokens compare to latent trajectory sampling?
- Why does reflection in reasoning models mostly confirm the first answer?
- What makes uncertainty tokens like Wait carry more information than content tokens?
- How do token-level rewards and rubric gates serve different statistical functions?
- What makes reasoning tokens identifiable within rollout groups for better rewards?
- Does the token prediction framing actually capture what human reasoning does?
- How does token-level interaction like ColBERT overcome commutativity constraints?
- Why does reflection in reasoning models often become theater rather than genuine thought?
- How do early-prefix tokens control the generation of entire continuations?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Which sentences actually steer a reasoning trace?
Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
sentence-level complement; MI peaks add information-theoretic evidence for the same sparse-pivot structure
-
Do high-entropy tokens drive reasoning model improvements?
Explores whether only a small fraction of tokens—those with high entropy at decision points—actually matter for improving reasoning performance in language models, and whether training on them alone could work as well as full training.
token-level RLVR analog: high-entropy tokens during training correspond to MI-peak tokens during inference
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
MI peaks explain what matters within the token budget: it's the density of thinking tokens, not total length
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
MI peaks as a mechanistic signature: RL training creates the MI-peak pattern that base models lack
-
Do reasoning cycles in hidden states reveal aha moments?
What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.
hidden-state topology confirms the same sparse-pivot structure
-
Can we measure how deeply a model actually reasons?
What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?
complementary token-level measurement: MI peaks identify WHICH tokens matter via information theory; DTR identifies HOW DEEPLY the model computes at each token via layer-wise prediction stabilization; orthogonal methods converging on the same sparse-pivot structure at the representation-graph level: cyclicity corresponds to backtracking tokens (MI peaks at self-correction), diameter tracks exploration breadth; both analyses converge on reasoning having a concentrated structure rather than uniform information distribution
-
Does self-distillation harm mathematical reasoning performance?
Self-distillation usually improves models while shortening outputs, but mathematical reasoning shows a puzzling exception: performance drops up to 40%. What mechanism explains this counter-intuitive degradation?
empirical consequence: when self-distillation suppresses the very Wait/Hmm tokens this note identifies as MI peaks, reasoning performance drops up to 40% on Qwen3 and DeepSeek-Distill. The Why-Does-Self-Distillation paper provides the strongest experimental confirmation that thinking tokens are functionally necessary — not just correlationally informative — across post-training procedures.
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
- Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks
- First Try Matters: Revisiting the Role of Reflection in Reasoning Models
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
- Thought Anchors: Which LLM Reasoning Steps Matter?
- Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
- Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
- When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models
Original note title
thinking tokens are mutual information peaks — sparse reflection and transition tokens carry disproportionate information about correct answers