INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do different training strategi…›How does policy entropy collapse c…›this inquiring line

Reward an AI for right answers and it stops exploring — researchers can now prove that narrowing is the hard ceiling on its reasoning.

Why does policy entropy collapse when scaling RL for reasoning?

This explores why a model's range of exploratory behavior shrinks (entropy collapse) as you scale up reinforcement learning for reasoning — and what that costs you.

This explores why a model's range of exploratory behavior shrinks — what researchers call policy entropy collapse — as you scale reinforcement learning for reasoning, and what that shrinkage costs. The short version from the corpus: RL rewards a policy for finding strategies that maximize reward, so it keeps doubling down on whatever already works. Diversity is the price. The collapse isn't a bug in one method; it's a structural tendency of reward-maximization to converge on a narrow set of high-scoring moves and abandon the rest of the solution space.

The sharpest result here is that this isn't a vague worry — it's a measurable ceiling. One line of work fits an empirical law, R = -a·exp(H) + b, where reasoning performance saturates exactly as policy entropy approaches zero Does policy entropy collapse limit reasoning performance in RL?. In other words, once the policy stops exploring, it stops improving — the entropy you burn early is performance you can't buy back later. That's why interventions like Clip-Cov, KL-Cov, and GPPO all target the same thing: slow the entropy drain so the policy keeps some exploratory capacity alive.

What makes this interesting is how general the mechanism is. The same convergence-on-narrow-strategies shows up in search agents, where RL squeezes behavioral diversity while supervised fine-tuning on varied demonstrations keeps exploration broad Does reinforcement learning squeeze exploration diversity in search agents?. It shows up in dialogue policies, which collapse to a single dominant action regardless of who they're talking to unless meta-learning forces them to stay variable Can meta-learning prevent dialogue policies from collapsing?. And it shows up as scale-dependent collapse in social reasoning, where models below a capacity threshold reach decent accuracy through brittle shortcuts rather than real belief-tracking Does reinforcement learning on theory of mind collapse with model scale?. Different domains, same gravitational pull toward the cheapest reward-maximizing behavior.

There's a deeper clue about the cause in two findings about what RL actually changes. RL updates only 5–30% of parameters, in sparse but nearly-identical subnetworks across random seeds — it's making a small, structured, repeatable edit, not broadly reshaping the model Does reinforcement learning update only a small fraction of parameters?. And several results argue RL doesn't create reasoning so much as decide when to deploy capability the base model already has — hybrid models recover 91% of the gains just by routing tokens Does RL post-training create reasoning or just deploy it?. Read together, these suggest entropy collapse is what it looks like when a narrow optimizer sharpens a fixed underlying capability: there's little new to explore, so the policy concentrates rather than expands.

The most useful surprise is what breaks the collapse. Numerical rewards carry almost no information about *why* an answer failed, so the policy has nothing to explore toward — but chain-of-thought critiques let models climb off plateaus they were stuck on, because language feedback restores direction Can natural language feedback overcome numerical reward plateaus?. There's even a two-phase pattern where entropy on planning tokens *rises* in a later strategic-exploration phase, suggesting collapse and exploration aren't a single dial but vary by what part of reasoning you're optimizing Does RL training follow a predictable two-phase learning sequence?. So the answer to 'why does entropy collapse' isn't only 'reward-maximization is greedy' — it's also 'scalar rewards are information-poor,' which points at a different fix than just clamping the entropy term.

Sources 8 notes

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Can meta-learning prevent dialogue policies from collapsing?

Without MAML, hierarchical RL for Motivational Interviewing phases collapses to a dominant action regardless of user type. Meta-learning enables the master policy to maintain variability and adapt across diverse user profiles.

Does reinforcement learning on theory of mind collapse with model scale?

7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Show all 8 sources

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?2.57 match · arxiv ↗
From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR2.53 match · arxiv ↗
RAGEN-2: Reasoning Collapse in Agentic RL2.53 match · arxiv ↗
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models1.73 match · arxiv ↗
Emergent Hierarchical Reasoning In LLMs Through Reinforcement Learning1.71 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin1.68 match · arxiv ↗
Teaching Large Language Models to Reason with Reinforcement Learning1.67 match · arxiv ↗
RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs1.67 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher re-testing constraints on policy entropy collapse in RL-scaled reasoning. The question remains open: *Why does a model's exploratory behavior narrow as RL reasoning scales, and can this collapse be structurally reversed?*

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2025–May 2026. Key constraints reported:
• Policy entropy collapse follows an empirical law R = -a·exp(H) + b; once entropy → 0, performance saturates (arXiv:2505.22617, ~2025).
• RL updates only 5–30% of parameters in sparse, seed-consistent subnetworks; models don't reshape broadly, just sharpen existing capability (arXiv:2505.11711, ~2025).
• Scalar numerical rewards lack directional signal; chain-of-thought critiques restore exploratory capacity and break performance plateaus (arXiv:2506.03106, ~2025).
• Entropy collapse is not monolithic: planning-token entropy *rises* in later strategic phases, suggesting phase-dependent dynamics (arXiv:2505.22617, ~2025).
• Search agents show RL squeezes diversity while SFT on varied demos expands it; dialogue policies collapse to dominant actions unless meta-learning enforces variability.

Anchor papers (verify; mind their dates):
• arXiv:2505.22617 — The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models (2025-05)
• arXiv:2505.11711 — Reinforcement Learning Finetunes Small Subnetworks in Large Language Models (2025-05)
• arXiv:2506.03106 — Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback (2025-06)
• arXiv:2605.22817 — Vector Policy Optimization: Training for Diversity Improves Test-Time Search (2026-05)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the R = -a·exp(H) + b law, the subnetwork sparsity claim, and the numerical-vs.-language-feedback divide: has emergence of richer reward models, multi-objective RL, or dynamic entropy scheduling since mid-2026 relaxed or overturned any? Separate the durable question (entropy-reward trade-off likely persistent) from perishable claims (e.g., 'only language feedback works'). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—particularly any claiming entropy collapse is reversible, or that diversity and performance are no longer antagonistic.
(3) Propose 2 research questions that assume the regime has shifted: e.g., 'Can multi-head policy ensembles decouple entropy collapse from reward maximization?' or 'Do curriculum RL or meta-learned entropy schedules systematically outperform fixed clamping?'

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Reward an AI for right answers and it stops exploring — researchers can now prove that narrowing is the hard ceiling on its reasoning.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8