What role do high-entropy minority tokens play in RLVR?
This explores what high-entropy 'minority' tokens are in reinforcement learning with verifiable rewards (RLVR) — the small fraction of tokens where a reasoning model is genuinely uncertain — and why training seems to hinge on them.
This explores what high-entropy 'minority' tokens are in RLVR and why they punch so far above their weight. The core finding is concrete: only about 20% of tokens in a reasoning trace are high-entropy — these are the 'forking points' where the model faces a real decision about which way the reasoning goes, while the other 80% are low-entropy filler the model was always going to emit. RLVR primarily adjusts these forking tokens, and training exclusively on that 20% matches or even beats updating on every token. The minority carries the learning signal Do high-entropy tokens drive reasoning model improvements?.
What makes this interesting is how it reframes what RLVR is actually doing. A cluster of notes argues RLVR doesn't teach new reasoning at all — it sharpens access to reasoning the base model already had. Pass@k analysis shows base models can match or beat RLVR models when allowed many attempts, suggesting RLVR narrows sampling toward solutions already in the distribution rather than expanding the boundary Does RLVR actually expand what models can reason about?. Seen through the entropy lens, that 'narrowing' is precisely the model becoming more decisive at the forking points. The same logic explains the startling result that even random or spurious rewards can improve reasoning: the reward isn't injecting knowledge, it's triggering a phase transition that reorganizes behavior at exactly those high-entropy decision points Why does RLVR work with completely random rewards?, Why do random rewards improve reasoning for some models but not others?.
But decisiveness has a dark side, and this is where the lateral story gets sharp. If RLVR works by collapsing uncertainty at forking tokens, then collapsing it too aggressively is exactly the failure mode. One note describes 'capability boundary collapse' — RLVR prioritizing exploitation over exploration until the model's problem-solving scope actually shrinks; the proposed fix is to explicitly reward exploration of underused reasoning paths, i.e. to keep some of that productive uncertainty alive Why does RLVR training narrow a model's problem solving ability?. A related note shows RL converging on a single dominant output format within the first epoch while suppressing alternatives Does RL training collapse format diversity in pretrained models?. High-entropy tokens are the substrate this pressure acts on — the question is whether you're sharpening them or flattening them.
The minority tokens also help explain when RLVR goes wrong. Training on near-impossible problems lets rare accidental successes get treated as high-advantage trajectories, reinforcing degenerate shortcuts like answer-repetition that then contaminate genuine capability Do overly hard RLVR samples actually harm model capabilities?. And even when forking tokens are tuned well, the gains can be cosmetic: RLVR reliably improves local step-to-step coherence without guaranteeing the proof is globally valid Does RLVR actually improve mathematical reasoning or just coherence?, and benchmark jumps can reflect memorization on contaminated data rather than the behavioral activation RLVR genuinely produces Can genuine reasoning activation coexist with contaminated benchmarks?, Does RLVR success on math benchmarks reflect genuine reasoning improvement?.
The thing you didn't know you wanted to know: 'reasoning training' may be far more surgical than it sounds. A handful of uncertain moments per trace appear to be where almost all the learning lives — which is why you can train on a fifth of the tokens for free, why a wrong reward can still help, and why pushing too hard turns a strength into capability collapse. The art of RLVR is managing entropy at the forking points, not piling on reward signal everywhere.
Sources 10 notes
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
RLVR works nearly as well with spurious rewards as correct ones because it catalyzes a phase transition in model output distribution. The effectiveness depends on pretraining quality, not reward signal quality or training volume.
Qwen2.5-Math gains 16-25% MATH-500 improvement from random or incorrect rewards by activating latent code-reasoning behavior from pretraining, while Llama and OLMo show no gains. Pretraining format determines what optimization pressure can surface.
RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.
RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.