Can we identify which tokens actually matter for reasoning?
Most tokens in an answer are determined by language patterns rather than reasoning. Is there a way to distinguish the small fraction of tokens whose certainty genuinely depends on the chain of thought?
DRO introduces a clean operational definition of "the tokens that depend on the reasoning." For each token in a reference answer, measure the model's self-certainty under different sampled chain-of-thought prefixes. Most tokens — articles, connectives, lexically expected words — barely change in certainty across rollouts. A small minority show high variance: their certainty depends on which reasoning path was taken. These are the reasoning-reflective tokens. They are not lexically distinctive — they cannot be identified by surface features — but they carry the answer's actual sensitivity to the reasoning chain.
The implication for reward design is that the signal-to-noise ratio of a uniform average across all reference tokens is bad. Most of the average is dominated by tokens whose certainty is determined by language modeling rather than by reasoning. Whatever differential the reasoning chain produces is diluted by tokens that would have appeared regardless. The variance filter is what isolates the reasoning-bearing fraction of the answer.
Up-weighting these high-variance tokens produces a sharper reward contrast across rollouts in a group. The mechanism is purely statistical — no human annotation, no per-step rubric, no extra model. Cross-rollout variance is computed from the policy's own samples, which makes the method cheap relative to process reward models (PRMs) that require labeled intermediate steps.
The deeper point is that token-level reward dense-ness is not the issue. Token-level dense rewards have been proposed before. The issue is which tokens to weight, and the answer "weight tokens by their variance under different reasoning prefixes" turns out to be a self-supervised filter that recovers the reasoning-bearing dimension without supervision.
This connects to L2T's information-theoretic dense process rewards as an alternative dense-signal strategy: L2T scores reasoning steps by their contribution to answer correctness; DRO scores tokens by their sensitivity to reasoning. Both replace uniform averaging with a structure-aware signal; both achieve sample efficiency by concentrating the gradient where it matters.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes some tokens carry disproportionate information about answers?
- How does tokenization change what gets counted as valuable knowledge?
- Which tokens actually change across different reasoning paths in rollouts?
- What makes thinking tokens carry more information than other tokens?
- What makes uncertainty tokens like Wait carry more information than content tokens?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we reward reasoning steps without human annotation?
Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
alternative dense-reward design at the step level rather than the token level
-
Which tokens in reasoning chains actually matter most?
Do language models internally rank tokens by functional importance? Greedy pruning experiments explore whether models preserve symbolic computation while discarding linguistic scaffolding, and what this reveals about reasoning architecture.
independent evidence that reasoning chains have token-level structure that uniform averaging hides
-
Can rubrics and dense rewards work together without hacking?
Explores whether reward signals derived from rubrics suffer from exploitation, and whether separating rubric judgments from optimization signals could prevent this failure mode.
DRO's other half: the rubric-gate that complements R3
-
Can one statistical measure serve dual purposes in RL training?
Explores whether cross-rollout variance can simultaneously weight important tokens and filter low-signal queries, potentially unlocking efficiency gains in reasoning tasks without human labels.
DRO's third use of the same variance signal
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
- Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
- Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
- Thought Anchors: Which LLM Reasoning Steps Matter?
- Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
Original note title
reasoning-reflective tokens are identifiable by high cross-rollout variance under different CoT prefixes — most reference tokens are reasoning-invariant and dilute uniformly-averaged signals