INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do different training strategi…›How does policy entropy collapse c…›this inquiring line

Researchers thought RL training was making AI models 'narrow' — but the narrowing might only exist in how they measured it.

Why does policy entropy collapse primarily at token level rather than hidden states?

This explores a surprising finding: that the famous 'entropy collapse' in reasoning RL shows up when you measure at the level of token choices, but largely vanishes when you look at the model's internal hidden states instead.

This explores why entropy collapse — the way RL-trained models stop exploring and converge on narrow strategies — appears to be primarily a *token-level* phenomenon rather than something happening deep in the model's hidden representations. The corpus suggests the answer is partly that token-level entropy is the wrong measuring stick, and partly that exploration and exploitation live in different places than we assumed.

The sharpest piece here argues the exploration-exploitation trade-off isn't fundamental at all — it's a measurement artifact of looking at token probabilities Is the exploration-exploitation trade-off actually fundamental?. When you instead probe hidden states using 'Effective Rank' (roughly, how many independent directions the model's internal activity spans), exploration and exploitation show near-zero correlation. In other words, a policy can keep its internal representational richness intact while its *output distribution* over tokens sharpens toward a few high-reward choices. The collapse you see in token entropy isn't necessarily a collapse in what the model can represent — it's a collapse in what it commits to saying.

That reframing matters because the standard story treats token entropy as the master variable. There's a well-documented empirical law where reasoning performance saturates as policy entropy approaches zero, with interventions like Clip-Cov and KL-Cov designed specifically to slow that entropy drain Does policy entropy collapse limit reasoning performance in RL?. But if entropy collapse is concentrated at the token level, it makes sense *why* it concentrates there: only a small minority of tokens — the high-entropy 'forking points' where the model actually decides between reasoning paths — carry the learning signal, and RLVR primarily adjusts exactly those tokens Do high-entropy tokens drive reasoning model improvements?. RL is sharpening the few decision points that determine the trajectory, so the measurable entropy loss shows up loudest in the token stream while the bulk of the hidden-state machinery is left comparatively untouched.

There's also evidence the hidden states are doing something genuinely different from token outputs. Distilled reasoning models develop cyclic structure in their hidden-state 'reasoning graphs' — loops where the model revisits and reconsiders intermediate answers — and this cyclicity correlates with accuracy and maps onto documented 'aha moments' Do reasoning cycles in hidden states reveal aha moments?. That kind of internal reconsideration is precisely the exploratory behavior you'd expect to survive even as the surface-level token distribution narrows. The exploration moved inward; the metric stayed at the surface.

The practical sting is that token-level collapse is still real and still costs you. The same entropy-collapse mechanism that narrows reasoning also squeezes behavioral diversity in search agents, with SFT on diverse demonstrations needed to preserve breadth that RL strips away Does reinforcement learning squeeze exploration diversity in search agents?. So the lesson isn't 'entropy collapse is fake' — it's that diagnosing it by token entropy alone can mislead you into thinking the model lost capacity it actually still holds in its hidden states, and that interventions aimed at the hidden-state level may preserve performance better than blunt token-entropy bonuses.

Sources 5 notes

Is the exploration-exploitation trade-off actually fundamental?

Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do reasoning cycles in hidden states reveal aha moments?

Distilled reasoning models show ~5 cycles per sample versus near-zero in base models, and cyclicity correlates with accuracy. These cycles in hidden-state reasoning graphs directly map to RL-trained models' documented aha moments—moments when models reconsider intermediate answers.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: *Does policy entropy collapse truly concentrate at the token level, or does that reflect measurement choice rather than mechanistic reality?* Treat this as still-open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat all as perishable:
• Exploration-exploitation trade-off at token level is a measurement artifact; Effective Rank in hidden states shows near-zero correlation with token entropy collapse (~2025, arXiv:2509.23808).
• Only a small high-entropy minority of tokens (~20%) carry the learning signal in RLVR; RL sharpens these forking points while bulk hidden-state machinery remains untouched (~2026, arXiv:2605.22817).
• Policy entropy collapse saturates reasoning performance; interventions like Clip-Cov and KL-Cov designed to slow entropy drain show empirical law (~2025, arXiv:2505.22617).
• Hidden-state reasoning graphs develop cyclic structure (loops revisiting intermediate answers) that correlates with accuracy and survives token-level collapse (~2025, arXiv:2506.05744).
• RL training squeezes behavioral diversity in search agents; SFT on diverse demonstrations needed to preserve breadth (~2026, arXiv:2605.22817).

Anchor papers (verify; mind their dates):
• arXiv:2509.23808 (2025-09) — Hidden state approach displacing token-entropy framing.
• arXiv:2505.22617 (2025-05) — Entropy mechanism as bottleneck.
• arXiv:2506.05744 (2025-06) — Reasoning graph topology and cyclicity.
• arXiv:2605.22817 (2026-05) — Vector Policy Optimization reconciling diversity with test-time performance.

Your task:
(1) RE-TEST THE TOKEN VS. HIDDEN-STATE SPLIT. Has post-2026 work on mechanistic interpretability, probe-based attribution, or causal intervention (e.g., ablating high-entropy tokens) confirmed that token collapse is *orthogonal* to representational collapse? Or have newer evals (e.g., on chain-of-thought faithfulness or multi-step reasoning) found cases where token entropy *does* degrade hidden-state capacity? Separate: *Are token and hidden-state entropy genuinely decoupled (durable)* from *Can token-level interventions alone preserve reasoning (possibly outdated)*.
(2) Surface the strongest recent work (last 6 mo.) that either **contradicts** the hidden-state resilience claim or **supersedes** token-entropy as the right diagnostic. Have new test-time scaling laws, distillation techniques, or multi-agent orchestration shifted the bottleneck away from entropy collapse altogether?
(3) Propose 2 research questions assuming the regime has moved: (a) *If token collapse is decoupled from representational collapse, what *is* the actual bottleneck on reasoning length or diversity?* (b) *Can you design an RL objective that operates on hidden-state structure (e.g., Effective Rank targets) rather than token entropy, and does it outperform entropy regularization?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Researchers thought RL training was making AI models 'narrow' — but the narrowing might only exist in how they measured it.

Related lines of inquiry

Sources 5 notes

Papers this line draws on 8