SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals Reasoning, Retrieval, and Evaluation

Do high-entropy tokens drive reasoning model improvements?

Explores whether only a small fraction of tokens—those with high entropy at decision points—actually matter for improving reasoning performance in language models, and whether training on them alone could work as well as full training.

Synthesis note · 2026-02-22 · sourced from RLVR
How should researchers navigate LLM reasoning research? What does reward learning actually do to model reasoning?

In Chain-of-Thought reasoning, token entropy distribution follows a distinct pattern: the vast majority of tokens are generated with low entropy (completing ongoing linguistic structures), while a critical minority emerge with high entropy (functioning as pivotal decision points that determine the trajectory among multiple potential pathways). These high-entropy "forking tokens" are where the model actually decides between reasoning directions.

Three converging findings establish their primacy:

Causal role confirmed by intervention. Moderately increasing entropy of forking tokens during decoding measurably improves reasoning performance. Artificially reducing their entropy degrades it. The tokens are not just correlated with reasoning quality — they causally determine it.

RLVR primarily operates on forking tokens. Analysis of entropy evolution during RLVR training shows the reasoning model largely retains the base model's entropy patterns, with only gradual changes. Critically, RLVR primarily adjusts the entropy of high-entropy tokens while low-entropy tokens vary only minimally. The training signal is concentrated where it matters.

Sparse training matches or exceeds full training. Restricting policy gradient updates to the 20% highest-entropy tokens matches performance of full-gradient updates on Qwen3-8B and significantly surpasses full-gradient on Qwen3-32B (+11.04 on AIME'25) and Qwen3-14B (+4.79 on AIME'25). Training on the 80% lowest-entropy tokens leads to marked decline. This "beyond 80/20 rule" shows the minority carries the learning signal.

Since Does reinforcement learning update only a small fraction of parameters?, there is a striking parallel: RL operates on sparse critical subsets at both the parameter level (5-30% of parameters) and the token level (20% of tokens). The sparsity is not a limitation but a feature — concentrating the learning signal where it has leverage.

Since Which sentences actually steer a reasoning trace?, forking tokens are the token-level mechanistic correlate of thought anchors. Both identify critical decision points in reasoning, but at different granularities — thought anchors at the sentence level, forking tokens at the individual token level.

The sparse-token-leverage meta-claim. The convergence across signals is the load-bearing meta-claim. Four independent statistical lenses — token entropy during RLVR training (this paper), mutual-information peaks during inference (Do reflection tokens carry more information about correct answers?), cross-rollout variance under different CoT prefixes (Can we identify which tokens actually matter for reasoning?), and greedy-pruning functional importance (Which tokens in reasoning chains actually matter most?) — all identify the same sparse pivot structure. The signals are computed differently and surface different operational uses (training filter, inference allocation, reward weighting, trace compression), but the underlying claim they share is the same: the reasoning-bearing fraction of a reasoning trace is sparse, and the cheapest path to sample-efficient reasoning training, faithful trace compression, or focused reward signals is to identify those tokens cheaply. Which statistical signal you use depends on what you have access to: entropy when you have only outputs, variance when you can sample rollouts under different prefixes, MI when you have ground-truth answers, functional importance when you can ablate.

Inquiring lines that use this note as a source 181

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
18 direct connections · 153 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

high-entropy minority tokens are the critical forking points that drive rlvr effectiveness — restricting gradient updates to 20 percent of tokens matches or exceeds full updates