INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do different training strategi…›How does policy entropy collapse c…›this inquiring line

Is the way RL training makes AI stop exploring new strategies actually the main limit on how capable reasoning models can become?

Does policy entropy collapse represent the main bottleneck in reasoning-focused RL scaling?

This explores whether the loss of exploratory diversity during RL training (policy entropy collapse) is *the* ceiling on reasoning RL — or just one bottleneck among several the corpus identifies.

This explores whether policy entropy collapse — the way RL training narrows a model's exploration until it keeps sampling the same few reward-maximizing strategies — is the dominant ceiling on reasoning RL, or one constraint competing with others. The corpus gives entropy collapse a strong, even quantified, case for being the bottleneck: there's an empirical law where reasoning performance saturates as policy entropy approaches zero, and interventions that deliberately preserve exploratory capacity (Clip-Cov, KL-Cov, GPPO) push the ceiling back up Does policy entropy collapse limit reasoning performance in RL?. What makes this more than a single-paper claim is that the same mechanism shows up in a different domain: RL training on search agents compresses behavioral diversity exactly the way it does in reasoning, and supervised fine-tuning on diverse demonstrations is what restores exploration breadth Does reinforcement learning squeeze exploration diversity in search agents?. So entropy collapse is at least a *general* failure mode, not a quirk of one benchmark.

But the corpus also suggests the bottleneck moves as training progresses, which complicates the 'main bottleneck' framing. RL training appears to run in two phases: first, execution correctness drives the gains; then strategic planning becomes the constraint, with planning-token entropy actually *rising* while execution entropy stabilizes Does RL training follow a predictable two-phase learning sequence?. That's a subtle but important wrinkle — late in training, you don't necessarily want entropy to keep falling everywhere; you want it concentrated in the right place. Entropy collapse as a blanket diagnosis misses that the *location* of exploration matters.

There's also a more deflationary view in the corpus: maybe RL isn't expanding reasoning capability at all, so 'scaling' is the wrong frame. The evidence that base models already contain reasoning in latent form — and that RL mostly learns *when* to deploy it rather than *how* to do it, recovering most gains by routing tokens — implies the ceiling may be a capability the base model already has, not something exploration unlocks Does RL post-training create reasoning or just deploy it?. If that's right, preserving entropy buys you better access to existing skills, not new ones.

And the corpus points to bottlenecks that have nothing to do with entropy. Numerical rewards plateau because they carry no information about *why* an answer failed; swapping in natural-language critiques breaks plateaus that more reward signal couldn't Can natural language feedback overcome numerical reward plateaus?. Binary correctness rewards quietly degrade calibration by rewarding confident wrong guesses, fixable only by adding a proper scoring rule Does binary reward training hurt model calibration?. And the reward signal's statistical structure itself — cross-rollout variance used to both weight tokens and filter degenerate queries — drives stability and speed independent of exploration Can one statistical measure serve dual purposes in RL training?.

The honest read: entropy collapse is the best-characterized and most clearly *predictive* bottleneck in this corpus, with a quantified ceiling and cross-domain replication. But calling it *the* main one overstates a moving target — the binding constraint shifts from execution to planning over training, may be capped by latent base-model capability, and competes with reward-information and calibration failures that entropy-preservation alone won't touch. The more defensible claim is that entropy collapse is the bottleneck that scaling *naively* (more steps, more reward) runs into first — which is exactly why it looks primary.

Sources 7 notes

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Show all 7 sources

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR2.53 match · arxiv ↗
RAGEN-2: Reasoning Collapse in Agentic RL2.53 match · arxiv ↗
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models1.73 match · arxiv ↗
Emergent Hierarchical Reasoning In LLMs Through Reinforcement Learning1.71 match · arxiv ↗
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models1.68 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin1.68 match · arxiv ↗
Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?1.67 match · arxiv ↗
Teaching Large Language Models to Reason with Reinforcement Learning1.67 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL researcher re-evaluating whether policy entropy collapse remains the primary bottleneck in reasoning-focused LLM scaling. The question stays open; treat the findings below as dated claims.

What a curated library found — and when (dated claims, not current truth):
Findings span Sept 2024–May 2026. A corpus identified entropy collapse as a quantified, cross-domain failure mode: reasoning RL saturates as policy entropy→0, and entropy-preserving methods (Clip-Cov, KL-Cov, GPPO) push ceilings back up (~2025). The same mechanism appears in search-agent RL. BUT the corpus also surfaces tensions: RL training is two-phase (execution first, then planning); planning-token entropy *rises* while execution entropy stabilizes, suggesting entropy collapse as a blanket diagnosis misses *where* exploration matters. A deflationary view emerges: base models may already contain reasoning latent; RL learns *when* to deploy it, not *how*, so entropy preservation buys better routing, not new capability (~2025). Separate bottlenecks compete: numerical rewards plateau (fixed by natural-language feedback), binary rewards degrade calibration (fixed by proper scoring rules), and cross-rollout variance drives stability independent of exploration (~2024–2025).

Anchor papers (verify; mind their dates):
  • arXiv:2505.22617 (The Entropy Mechanism, May 2025)
  • arXiv:2504.09858 (Reasoning Models Without Thinking, Apr 2025)
  • arXiv:2506.03106 (Critique-GRPO, Jun 2025)
  • arXiv:2510.13786 (Art of Scaling RL Compute, Oct 2025)

Your task:
(1) RE-TEST THE MOVING TARGET. For each constraint above—entropy collapse, two-phase dynamics, latent-capability ceiling, reward-structure bottlenecks—probe whether newer models (o1, o3, etc.), training methods (vector policy, rubric anchors), or evals have *relaxed* or *overturned* it. Separate the durable question (does exploration breadth still matter?) from the perishable claim (is it the *main* bottleneck?). Cite what resolved it.
(2) Surface the strongest *disagreement* or *contradiction* in the last 6 months. Does any recent work argue entropy collapse is NOT primary, or that latent-capability thesis is wrong?
(3) Propose 2 research questions that assume the regime has shifted: e.g., "If planning entropy rises during RL, what training dynamic creates phase separation?" or "Can you measure how much RL unlocks latent vs. novel reasoning?"

Cite arXiv IDs; flag anything you cannot ground.

Is the way RL training makes AI stop exploring new strategies actually the main limit on how capable reasoning models can become?

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8