SYNTHESIS NOTE

Do reflection tokens carry more information about correct answers?

Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.

Synthesis note · 2026-02-23 · sourced from MechInterp

By tracking mutual information (MI) between intermediate representations and the correct answer at each step of LRM reasoning, an interesting phenomenon emerges: MI spikes suddenly at specific steps, creating sparse, non-uniform "MI peaks" throughout the reasoning process.

These peaks overwhelmingly correspond to tokens expressing reflection, self-correction, or transitions — "Wait," "Hmm," "Therefore," "So" — which the authors term "thinking tokens." Three key findings:

Thinking tokens are functionally necessary. Fully suppressing them significantly harms reasoning performance. Randomly suppressing the same number of tokens has minimal impact. The information is concentrated in the thinking tokens, not distributed across the trace.
MI peaks are a training artifact. Base models (e.g., LLaMA-3.1-8B) do not exhibit the MI peaks phenomenon clearly. The distinct pattern emerges from reasoning-intensive training (RL post-training). This suggests reasoning training teaches models to concentrate information at specific reflection points.
Two practical improvements follow. Representation Recycling (allowing MI-peak representations to iterate through the model multiple times) improves accuracy by 20% on AIME24. Thinking Token Test-time Scaling (forcing continued reasoning from thinking tokens when budget remains) yields steady performance improvements.

This provides an information-theoretic complement to the sentence-level thought anchors finding. Which sentences actually steer a reasoning trace? identifies planning and backtracking sentences via counterfactual, attention, and causal suppression methods. MI peaks identify the same pivotal role via information theory — converging from a different analytical direction.

The convergence across methods (counterfactual importance, attention patterns, causal suppression, and now mutual information) and across granularity levels (token-level MI peaks, sentence-level thought anchors, RLVR's high-entropy forking tokens) strongly supports the claim that reasoning traces have a sparse-pivot structure. Most tokens are filler; a small subset carries the reasoning signal.

Inquiring lines that read this note 84

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do we evaluate AI systems when user perception misleads actual performance?

Why do one-shot transparency studies miss the temporal reversal entirely?

How does AI assistance affect human cognitive development and reasoning autonomy?

How does rhetorical adaptation affect LLM persuasion and detectability?

How does smooth probabilistic flow differ from turbulent rhetorical exploration?

Does tokenized intelligence retain genuine value through exchange-based systems?

Can ensemble evaluation methods reduce bias more than single judges?

What distinguishes evaluative stance-taking from the mechanical conformity shape-holding describes?

Why do reasoning models fail at systematic problem-solving and search?

Why does the first generated token trigger collapse of task superposition?

When do additional thinking tokens stop improving reasoning performance?

How do training data properties shape reasoning capability development?

What distinguishes genuine reasoning activation from memorization-assisted answer recall?

Does self-reflection enable models to reliably correct their errors?

How do soft continuous representations explore multiple reasoning paths simultaneously?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

Can next-token prediction alone produce genuine language understanding?

What structural biases does transformer attention create in language model outputs?

Why do transformers weight early tokens more heavily than later ones?

How do prompt structure and constraints affect model instruction reliability?

How do formal dialogue structures reveal conversation coherence mechanisms?

What makes intentional structure shifts different from segment boundaries?

How do transformer attention mechanisms implement memory and algorithmic functions?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

What makes symbolic operations different from general knowledge questions?

How does latent reasoning compare to verbalized chain-of-thought?

Does conversational format create illusions of genuine AI communication?

What is event-residue and how does it differ from utterances?

How should dialogue recommender systems manage conversation history and state?

How does the EAFR schema distinguish between reflection and action in conversation?

Can prompting inject entirely new knowledge into language models?

How do smaller models respond to longer reflection prompts?

How can process reward models supervise complex reasoning traces?

How do partial credit grading systems accidentally reward reasoning theater?

Why do correct reasoning traces tend to be shorter than incorrect ones?

How does chain-of-thought length affect attention to constraint tokens?

Why does self-revision increase model confidence while degrading accuracy?

Why do final answers contradict what the thinking draft explicitly concluded?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

What distinguishes memorized tokens from causally necessary reasoning steps?

How can recommendation systems balance personalization with stability and coverage?

When should persona attention weight activate versus stay dormant during scoring?

How should iterative research systems allocate reasoning per search step?

How does reflection-based query refinement differ from single-pass retrieval strategies?

What dimensions of recommendation quality do standard metrics miss?

Can knowledge density per token be measured as a quality metric?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

What properties determine whether reward signals teach genuine reasoning?

Does reinforcement learning teach reasoning or just when to reason?

What makes reasoning tokens identifiable within rollout groups for better rewards?

Why do semantic similarity and task relevance diverge in vector embeddings?

How does token-level interaction like ColBERT overcome commutativity constraints?

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 149 in 2-hop network ·dense cluster Open in graph ↗

Do reflection tokens carry more information abou… Which sentences actually steer a reasoning trace? Do high-entropy tokens drive reasoning model impro… Does more thinking time always improve reasoning a… Does RL teach reasoning or just when to use it? Do reasoning cycles in hidden states reveal aha mo… Can we measure how deeply a model actually reasons… Does self-distillation harm mathematical reasoning…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Which sentences actually steer a reasoning trace? Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
sentence-level complement; MI peaks add information-theoretic evidence for the same sparse-pivot structure
Do high-entropy tokens drive reasoning model improvements? Explores whether only a small fraction of tokens—those with high entropy at decision points—actually matter for improving reasoning performance in language models, and whether training on them alone could work as well as full training.
token-level RLVR analog: high-entropy tokens during training correspond to MI-peak tokens during inference
Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
MI peaks explain what matters within the token budget: it's the density of thinking tokens, not total length
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
MI peaks as a mechanistic signature: RL training creates the MI-peak pattern that base models lack
Do reasoning cycles in hidden states reveal aha moments? What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.
hidden-state topology confirms the same sparse-pivot structure
Can we measure how deeply a model actually reasons? What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?
complementary token-level measurement: MI peaks identify WHICH tokens matter via information theory; DTR identifies HOW DEEPLY the model computes at each token via layer-wise prediction stabilization; orthogonal methods converging on the same sparse-pivot structure at the representation-graph level: cyclicity corresponds to backtracking tokens (MI peaks at self-correction), diameter tracks exploration breadth; both analyses converge on reasoning having a concentrated structure rather than uniform information distribution
Does self-distillation harm mathematical reasoning performance? Self-distillation usually improves models while shortening outputs, but mathematical reasoning shows a puzzling exception: performance drops up to 40%. What mechanism explains this counter-intuitive degradation?
empirical consequence: when self-distillation suppresses the very Wait/Hmm tokens this note identifies as MI peaks, reasoning performance drops up to 40% on Qwen3 and DeepSeek-Distill. The Why-Does-Self-Distillation paper provides the strongest experimental confirmation that thinking tokens are functionally necessary — not just correlationally informative — across post-training procedures.

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

thinking tokens are mutual information peaks — sparse reflection and transition tokens carry disproportionate information about correct answers

Do reflection tokens carry more information about correct answers?

Inquiring lines that read this note 84

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4