What makes uncertainty tokens like Wait carry more information than content tokens?
This explores why a handful of "thinking" tokens — like *Wait*, *Therefore*, *Hmm* — seem to matter far more for reasoning than the ordinary words around them, and what "more information" actually means here.
This explores why a handful of "thinking" tokens — like *Wait*, *Therefore*, *Hmm* — seem to matter far more for reasoning than the ordinary words around them. The short answer the corpus suggests: these tokens sit at decision points, not description points. Most tokens in a chain of thought are just spelling out a path the model has already committed to. A small minority mark the moments where the path could fork — where the model pauses, reconsiders, or commits to a direction. "Information" here is measured by how much a token's presence shifts the odds of landing on a correct answer, and these uncertainty tokens spike on exactly that measure. One study finds tokens like *Wait* and *Therefore* are literal peaks in mutual information with the correct answer — and crucially, suppressing them damages reasoning, while suppressing the same number of random tokens does almost nothing Do reflection tokens carry more information about correct answers?.
The same pattern shows up from a completely different angle in reinforcement learning. Only about 20% of tokens carry high entropy — meaning the model is genuinely uncertain which token comes next — and these "forking" tokens are where reasoning decisions actually happen. Training a model only on that high-entropy minority matches or beats training on every token, which says the learning signal lives in the uncertainty, not the content Do high-entropy tokens drive reasoning model improvements?. So uncertainty and informativeness turn out to be two views of the same thing: a token carries information precisely because the model wasn't sure, and resolving that uncertainty is what steers the outcome.
What's neat is that you can find these tokens without anyone labeling them. If you run the same problem many times and watch which tokens flip their certainty depending on what reasoning came before, a small subset swings wildly while most stay stable — and that variance, computable from the model's own samples, fingerprints the reasoning-bearing tokens Can we identify which tokens actually matter for reasoning?. A related line ranks tokens by functional role and finds models preferentially preserve symbolic-computation tokens while pruning grammar and filler first — so the model itself behaves as if it knows which tokens are load-bearing Which tokens in reasoning chains actually matter most?.
Here's the thing you might not have known you wanted to know: this uncertainty signal is useful well beyond reasoning chains. The same token-level uncertainty that makes *Wait* informative can be read off as calibrated confidence and put to work. Simple token-probability uncertainty estimates beat elaborate adaptive-retrieval schemes at deciding when a model should go look something up — the model's own self-knowledge is more reliable than external heuristics Can simple uncertainty estimates beat complex adaptive retrieval?. Small models trained to be uncertainty-aware and to abstain when unsure can match models ten times larger Can models learn to abstain when uncertain about predictions?, and a model's confidence even predicts whether it'll buckle under a reworded prompt Does model confidence predict robustness to prompt changes?.
One provocative extension: if discrete uncertainty tokens are where reasoning forks, why force the model to pick one? "Soft Thinking" keeps the full probability distribution as a continuous "concept token," letting the model hold multiple reasoning paths in superposition instead of collapsing to a single word — improving accuracy while cutting token count Can we explore multiple reasoning paths without committing to one token?. That reframes the whole question: uncertainty tokens carry more information because they encode a branching decision the model would otherwise be forced to throw away.
Sources 8 notes
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
A small subset of tokens in reference answers change their certainty sharply depending on which chain of thought precedes them, while most tokens remain stable. This variance pattern, computable from the model's own samples, identifies reasoning-bearing tokens without supervision.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
Training-free method replaces discrete token selection with probability-weighted concept embeddings, preserving superposition of reasoning paths. Improves accuracy up to 2.48 points while reducing tokens 22.4% via entropy-based early stopping.