INQUIRING LINE

What causes policy entropy collapse in reasoning-focused reinforcement learning?

This explores why reasoning-focused RL training tends to lose exploratory diversity — the policy's entropy collapsing toward zero — and what drives that, drawing on the corpus's empirical laws, mechanistic accounts, and counter-interventions.


This explores why reasoning-focused RL training tends to lose its exploratory diversity — the policy's entropy collapsing toward zero — and what's actually driving it. The clearest answer in the corpus is that entropy collapse isn't a side effect; it's the main bottleneck. There's a strikingly tidy empirical law, R = -a·exp(H) + b, showing that reasoning performance saturates exactly as policy entropy approaches zero — meaning a model trades away its exploratory capacity for reward, and once that capacity is gone, performance hits a ceiling it can't climb past Does policy entropy collapse limit reasoning performance in RL?. The mechanism is convergence: RL pushes the policy to repeatedly exploit whatever narrow strategy maximizes reward, and the distribution sharpens around those few moves until it can no longer try anything new.

The cause becomes more concrete when you look at *which* tokens carry the entropy. Only about 20% of tokens are high-entropy 'forking points' — the genuine decision moments where reasoning could branch one way or another — and RL almost exclusively adjusts these Do high-entropy tokens drive reasoning model improvements?. Entropy collapse, read through that lens, is the flattening of these forks: the model stops hesitating at the junctions where exploration used to happen. That's also why it isn't unique to math reasoning — search agents trained with RL show the same compression of behavioral diversity, converging on narrow reward-maximizing strategies through the identical mechanism, while SFT on diverse demonstrations keeps the exploration breadth alive Does reinforcement learning squeeze exploration diversity in search agents?.

A deeper structural cause is hinted at by the discovery that numerical rewards are simply *information-poor*. A scalar reward tells the model whether it succeeded but nothing about *why* it failed or how to improve — so the optimization has no signal pointing toward alternative strategies, only toward sharpening the one that currently scores. When models stuck on a plateau are instead given chain-of-thought critiques in natural language, they escape it, which suggests the collapse is partly starvation: the reward channel is too thin to sustain exploration Can natural language feedback overcome numerical reward plateaus?.

Usefully, the corpus reframes collapse as one half of a paired failure. Training-time entropy collapse and test-time variance inflation are 'dual' problems — both spring from a broken exploration–exploitation balance, but at different timescales, and a fix for one (entropy bonuses, critique diversity during training) won't touch the other Why do reasoning models fail differently at training versus inference?. That's the non-obvious part: managing entropy isn't a single dial. And the interventions that work — Clip-Cov, KL-Cov, GPPO — all operate by deliberately slowing the rate at which entropy is allowed to drain, rather than by changing the reward Does policy entropy collapse limit reasoning performance in RL?.

If you want to go further, the corpus also complicates the simple 'collapse is bad' story: RL training moves through two phases, and entropy doesn't uniformly fall — execution-token entropy stabilizes while *planning*-token entropy actually rises as strategic exploration becomes the new bottleneck Does RL training follow a predictable two-phase learning sequence?. So the real picture is less 'entropy always collapses' and more 'entropy collapses where the model has stopped needing to explore, and the open question is keeping it alive where exploration still matters.'


Sources 6 notes

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Why do reasoning models fail differently at training versus inference?

Both failures stem from failed exploration-exploitation balance but occur at different timescales requiring structurally distinct interventions. Training-time fixes (entropy bonuses, critique diversity) cannot prevent inference-time variance inflation, and vice versa; both loops must be managed independently.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-RL researcher re-examining whether policy entropy collapse remains the primary bottleneck in 2024–2026 models. The question: *what actually causes entropy to vanish during reasoning-focused RL, and can we keep it alive where it matters?*

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2025–May 2026. A curated library identified:
• Entropy collapse follows R = −a·exp(H) + b: reasoning performance saturates exactly as policy entropy approaches zero, capping scaling (2025-05).
• Only ~20% of tokens are high-entropy 'forking points'; RL flattens these decision junctions while leaving execution tokens stable (2025-06).
• Numerical rewards are information-poor; natural-language critiques escape plateaus that scalar rewards cannot (2025-06).
• Training-time entropy collapse and test-time variance inflation are dual failures springing from the same broken exploration–exploitation balance; fixing one doesn't touch the other (2025-06).
• RL training exhibits two phases: execution-token entropy stabilizes early, while planning-token entropy rises later as strategic exploration becomes the new bottleneck (2025-06).

Anchor papers (verify; mind their dates):
• arXiv:2505.22617 (2025-05) — The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
• arXiv:2506.01939 (2025-06) — Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning
• arXiv:2506.03106 (2025-06) — Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
• arXiv:2605.22817 (2026-05) — Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Your task:
(1) RE-TEST THE SATURATION LAW AND DUAL-FAILURE CLAIM. Does the R = −a·exp(H) + b relationship still hold under newer model scales, multi-agent orchestration (e.g., tree search with persistent memory), or modern post-training methods (e.g., RLP as pretraining, rubric anchors)? Has critique diversity or energy-based transformers actually decoupled training entropy from test variance, or do they merely shift the phase boundary? Plainly state where the bottleneck persists and where it has loosened.
(2) Surface the strongest DISAGREEMENT or SUPERSEDING claim from the last 6 months. Does arXiv:2605.22817 or arXiv:2510.01265 (RLP) argue entropy collapse is *not* the primary limit, or that the phase transition changes the story? Flag any paper arguing the bottleneck is *elsewhere* (e.g., reward misspecification, planning-horizon bound, data efficiency).
(3) Propose 2 research questions that *assume the regime may have shifted*: (a) If planning-token entropy is the new frontier, how do we design RL objectives that preserve or amplify it without collapsing execution diversity? (b) Can learned feedback (natural-language or rubric) replace the information deficit at scale, or do we need a fundamentally different feedback channel?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines