INQUIRING LINE

What happens to model reasoning when policy entropy collapses during RL?

This explores what actually happens to a model's reasoning when its exploratory diversity (policy entropy) shrinks toward zero during reinforcement learning — and whether that collapse is the thing capping reasoning gains.


This explores what happens to reasoning when policy entropy collapses during RL — the moment when a model stops exploring varied solution paths and converges on a narrow set of reward-maximizing moves. The short version the corpus suggests: entropy collapse is less a side effect than the main ceiling on how far RL can push reasoning, and it tends to make models *sharper at what they already do* rather than *capable of more*.

The most direct account comes from work showing entropy collapse is the primary bottleneck in scaling RL for reasoning Does policy entropy collapse limit reasoning performance in RL?. It even fits a tidy empirical law — performance saturates as policy entropy approaches zero — and proposes interventions (Clip-Cov, KL-Cov, GPPO) that deliberately slow entropy's decline to keep the model exploring. So the headline answer is: when entropy collapses, reasoning gains flatline at a predictable ceiling. Strikingly, the same mechanism shows up beyond pure reasoning: RL training on search agents squeezes their behavioral diversity through the identical entropy-collapse dynamic, converging policies onto a few narrow strategies — and SFT on diverse demonstrations is what restores exploration breadth Does reinforcement learning squeeze exploration diversity in search agents?. That cross-domain echo is the tell that this is a structural property of reward optimization, not a quirk of math problems.

Here's the part a curious reader might not expect: several notes argue the collapse doesn't *destroy* reasoning so much as expose what RL was ever doing. RLVR doesn't expand the boundary of what a model can solve — pass@k analysis shows base models actually beat RLVR-tuned models at high k, meaning RL just concentrates sampling onto solutions the base model already had Does RLVR actually expand what models can reason about?. A companion view frames verifiable rewards as catalysts that surface pretrained strategies rather than teachers that build new ones, with updates that are structurally sparse and bounded by the prior How does RL training reshape reasoning and what gets lost?. And the 'RL teaches when, not how' result drives it home: base models already contain the reasoning in latent form, and RL is optimizing *deployment timing* Does RL post-training create reasoning or just deploy it?. Read together, entropy collapse is the visible signature of distribution-sharpening: the policy narrows toward known-good paths, which raises average accuracy while quietly shrinking the breadth of solutions it can still reach.

The damage isn't only about diversity, though — it can corrupt the model's self-knowledge too. Binary correctness rewards (a common RL setup) provably degrade calibration, pushing models toward confident guessing because nothing penalizes a confident wrong answer; adding a Brier-score term restores calibration without a trade-off Does binary reward training hurt model calibration?. So 'collapse' has two faces: the policy gets narrow *and* overconfident. There's also nuance on *which* entropy collapses — a two-phase view finds execution entropy stabilizes early while planning-token entropy actually keeps rising, suggesting the productive exploration migrates to strategic planning even as low-level execution locks in Does RL training follow a predictable two-phase learning sequence?. Collapse isn't uniform across the reasoning stack.

If the diagnosis is 'RL kills exploration,' the corpus also hints at sidesteps that avoid gradient-driven collapse entirely. Training-Free GRPO gets RL-like distribution shifts by distilling semantic advantages into the prompt as a token prior — no parameter updates, so no entropy to collapse Can semantic knowledge shift model behavior like reinforcement learning does?. Memory-based online RL pushes the same idea further, achieving continual adaptation purely through memory operations while leaving weights untouched Can agents learn continuously from experience without updating weights?. The throughline across all of this: entropy collapse is what makes RL good at exploitation and bad at expansion — and the frontier of the field is figuring out how to get the gains without paying the diversity tax.


Sources 9 notes

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can semantic knowledge shift model behavior like reinforcement learning does?

Training-Free GRPO distills semantic advantages from rollout groups into prompts, shifting output distributions toward better answers through in-context learning rather than gradient updates. With few dozen training samples, it outperforms fine-tuned small LLMs and works with black-box APIs.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether entropy collapse remains the primary bottleneck in RL-scaled reasoning, or whether newer methods/models have relaxed it. The question: **Does policy entropy collapse still limit reasoning capability gains in RL, or have recent advances (new architectures, training regimes, or evaluation methods) circumvented the constraint?**

What a curated library found — and when (findings span 2024–2026; dated claims, not current truth):
• Entropy collapse is the primary bottleneck in RL scaling for reasoning; performance saturates as policy entropy → 0 (2025-05, arXiv:2505.22617).
• RLVR does not expand base-model reasoning boundaries — base models beat tuned ones at high k, suggesting RL concentrates rather than expands (2025-10, arXiv:2510.18176).
• Binary correctness rewards provably degrade calibration; Brier-score terms restore it without trade-off (cited in synthesis, ~2025).
• Two-phase entropy dynamics: execution entropy locks early, planning-token entropy rises — productive exploration migrates to higher reasoning tiers (2025, synthesis reference).
• Training-Free GRPO and memory-based online RL sidestep gradient collapse via prompt priors and weight-frozen memory, achieving RL-like gains without parameter updates (2025-10 & synthesis, ~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2505.22617 (2025-05) — *The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models*
• arXiv:2510.18176 (2025-10) — *Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains*
• arXiv:2510.08191 (2025-10) — *Training-Free Group Relative Policy Optimization*
• arXiv:2605.22817 (2026-05) — *Vector Policy Optimization: Training for Diversity Improves Test-Time Search*

Your task:
(1) **RE-TEST THE ENTROPY CONSTRAINT.** For each finding above, judge whether models post-2026-Q3, new RL algorithms (e.g., VPO, memory-augmented methods), or evaluation harnesses (longer horizons, multi-step verification) have since **relaxed or overturned** entropy's role as a ceiling. Separate the durable question (does RL optimize distribution rather than expand capability?) from the perishable claim (entropy collapse *is the* bottleneck). Where has the constraint held; where has it broken?
(2) **SURFACE STRONGEST CONTRADICTIONS.** Identify work from the last ~6 months that directly disputes entropy-as-bottleneck or shows reasoning gains *despite* entropy collapse. Flag disagreements between cited papers.
(3) **PROPOSE 2 FORWARD QUESTIONS** that assume the regime may have shifted — e.g., "If memory-based RL truly decouples exploration from parameter updates, what is the new bottleneck?" or "Does test-time scaling (e.g., o1-style reasoning) obsolete the entropy-collapse story?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines