Why does policy entropy collapse primarily at token level rather than hidden states?
This explores a surprising finding: that the famous 'entropy collapse' in reasoning RL shows up when you measure at the level of token choices, but largely vanishes when you look at the model's internal hidden states instead.
This explores why entropy collapse — the way RL-trained models stop exploring and converge on narrow strategies — appears to be primarily a *token-level* phenomenon rather than something happening deep in the model's hidden representations. The corpus suggests the answer is partly that token-level entropy is the wrong measuring stick, and partly that exploration and exploitation live in different places than we assumed.
The sharpest piece here argues the exploration-exploitation trade-off isn't fundamental at all — it's a measurement artifact of looking at token probabilities Is the exploration-exploitation trade-off actually fundamental?. When you instead probe hidden states using 'Effective Rank' (roughly, how many independent directions the model's internal activity spans), exploration and exploitation show near-zero correlation. In other words, a policy can keep its internal representational richness intact while its *output distribution* over tokens sharpens toward a few high-reward choices. The collapse you see in token entropy isn't necessarily a collapse in what the model can represent — it's a collapse in what it commits to saying.
That reframing matters because the standard story treats token entropy as the master variable. There's a well-documented empirical law where reasoning performance saturates as policy entropy approaches zero, with interventions like Clip-Cov and KL-Cov designed specifically to slow that entropy drain Does policy entropy collapse limit reasoning performance in RL?. But if entropy collapse is concentrated at the token level, it makes sense *why* it concentrates there: only a small minority of tokens — the high-entropy 'forking points' where the model actually decides between reasoning paths — carry the learning signal, and RLVR primarily adjusts exactly those tokens Do high-entropy tokens drive reasoning model improvements?. RL is sharpening the few decision points that determine the trajectory, so the measurable entropy loss shows up loudest in the token stream while the bulk of the hidden-state machinery is left comparatively untouched.
There's also evidence the hidden states are doing something genuinely different from token outputs. Distilled reasoning models develop cyclic structure in their hidden-state 'reasoning graphs' — loops where the model revisits and reconsiders intermediate answers — and this cyclicity correlates with accuracy and maps onto documented 'aha moments' Do reasoning cycles in hidden states reveal aha moments?. That kind of internal reconsideration is precisely the exploratory behavior you'd expect to survive even as the surface-level token distribution narrows. The exploration moved inward; the metric stayed at the surface.
The practical sting is that token-level collapse is still real and still costs you. The same entropy-collapse mechanism that narrows reasoning also squeezes behavioral diversity in search agents, with SFT on diverse demonstrations needed to preserve breadth that RL strips away Does reinforcement learning squeeze exploration diversity in search agents?. So the lesson isn't 'entropy collapse is fake' — it's that diagnosing it by token entropy alone can mislead you into thinking the model lost capacity it actually still holds in its hidden states, and that interventions aimed at the hidden-state level may preserve performance better than blunt token-entropy bonuses.
Sources 5 notes
Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Distilled reasoning models show ~5 cycles per sample versus near-zero in base models, and cyclicity correlates with accuracy. These cycles in hidden-state reasoning graphs directly map to RL-trained models' documented aha moments—moments when models reconsider intermediate answers.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.