Which tokens actually change across different reasoning paths in rollouts?
This explores which specific tokens vary from one reasoning rollout to another — the unstable, decision-bearing tokens — versus the bulk that stay fixed no matter how the model reaches its answer.
This explores which specific tokens vary from one reasoning rollout to another — the unstable, decision-bearing ones — versus the bulk that stay fixed regardless of the path taken. The corpus converges on a striking answer: only a small minority of tokens actually move, and that minority is where the reasoning lives. One line of work shows that a few tokens in a reference answer sharply change their certainty depending on which chain of thought precedes them, while most tokens remain stable across samples — and crucially, you can find these tokens just by measuring variance across the model's own rollouts, without any labels Can we identify which tokens actually matter for reasoning?.
The same minority shows up when you measure entropy instead of variance. Roughly 20% of tokens are high-entropy 'forking points' where the model is genuinely deciding between continuations; reinforcement learning with verifiable rewards mostly adjusts exactly these tokens, and training on them alone matches full-gradient updates Do high-entropy tokens drive reasoning model improvements?. And when you ask which tokens carry information about the *correct* answer, the spikes land on reflection and transition words like 'Wait' and 'Therefore' — suppress those and accuracy collapses, suppress an equal number of random tokens and nothing happens Do reflection tokens carry more information about correct answers?. Three different lenses — variance, entropy, mutual information — keep pointing at the same sparse set.
What distinguishes these tokens from the rest? A complementary study prunes reasoning chains by functional category and finds models preferentially preserve symbolic-computation tokens while throwing away grammar and meta-discourse first — so the load-bearing tokens are the ones doing actual work, not the connective filler Which tokens in reasoning chains actually matter most?. This dovetails with the unsettling finding that traces deliberately corrupted into nonsense teach nearly as well as correct ones: if most tokens are scaffolding rather than meaning, it makes sense that only a few pivotal positions actually steer the outcome Do reasoning traces need to be semantically correct?.
The forking tokens also explain a failure mode. Models often abandon a promising path right at a thought-transition token and switch to another approach prematurely; penalizing exactly those transition tokens during decoding improves accuracy with no retraining Do reasoning models switch between ideas too frequently?. So the high-variance tokens aren't just diagnostic — they're the control surface. There's even a twist on when this matters: on easy problems the model commits internally long before the reasoning finishes (the tokens are performative), while on hard problems the rollout genuinely tracks belief updates with detectable inflection points Does chain-of-thought reasoning reflect genuine thinking or performance?.
If you take this seriously, an obvious move is to stop forcing a single discrete choice at each fork. That's the bet behind keeping the probability distribution alive as a continuous 'concept token' so multiple reasoning paths stay in superposition rather than collapsing to one branch Can we explore multiple reasoning paths without committing to one token? — and behind shared-prefix tree rollouts that branch only at the points where trajectories diverge, getting more distinct paths per token budget by not re-sampling the stable prefix everyone agrees on Can shared-prefix trees reduce redundancy in agent rollouts?. Both designs are, in effect, built around the insight that the tokens worth spending compute on are the few that change.
Sources 9 notes
A small subset of tokens in reference answers change their certainty sharply depending on which chain of thought precedes them, while most tokens remain stable. This variance pattern, computable from the model's own samples, identifies reasoning-bearing tokens without supervision.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Activation probes show models commit to answers internally long before finishing their reasoning on easy tasks, but on hard tasks the reasoning process tracks real belief updates with detectable inflection points. Probe-guided early exit reduces tokens by up to 80 percent without accuracy loss.
Training-free method replaces discrete token selection with probability-weighted concept embeddings, preserving superposition of reasoning paths. Improves accuracy up to 2.48 points while reducing tokens 22.4% via entropy-based early stopping.
Tree-structured rollouts that branch from shared prefixes produce more distinct trajectories within a fixed token budget than independent chain sampling. This improves advantage estimation statistics and enables longer-horizon tasks within the same compute constraint.