INQUIRING LINE

Why does policy entropy collapse predict sigmoid saturation points?

This explores why a model's performance curve flattens out exactly when its exploration runs dry — the link between shrinking policy entropy in RL and the predictable ceiling where extra training stops paying off.


This explores why a model's performance curve flattens out exactly when its exploration runs dry — the connection between collapsing policy entropy during reinforcement learning and the predictable saturation point where more compute buys almost nothing. The cleanest answer in the corpus is an empirical law: performance follows R = -a·exp(H) + b, where H is policy entropy. As entropy falls toward zero, the exponential term vanishes and R presses up against the ceiling b. That's the saturation — not a coincidence but a mathematical consequence of entropy being the fuel that the curve burns. When the policy stops exploring, the reachable upside is already priced in Does policy entropy collapse limit reasoning performance in RL?.

The mechanism is worth making concrete: RL rewards push a policy to concentrate probability on whatever strategy is currently winning. Each update narrows the distribution, entropy drops, and the model samples fewer distinct attempts. Early on this is pure gain — you're cutting bad behaviors. But the same force that sharpens you also strands you, because once entropy is spent you can no longer stumble onto a better strategy than the one you've converged to. Interventions like Clip-Cov, KL-Cov, and GPPO are explicitly attempts to ration that fuel — slowing entropy reduction so the saturation point arrives later and higher Does policy entropy collapse limit reasoning performance in RL?.

What makes this more than a one-paper curiosity is that the same collapse shows up in domains that share none of the original vocabulary. Search agents trained with RL squeeze their behavioral diversity in exactly the way reasoning models do — policies pile onto narrow reward-maximizing routines — and the fix is the same: supervised fine-tuning on diverse demonstrations re-injects the exploration breadth that RL drains Does reinforcement learning squeeze exploration diversity in search agents?. So the sigmoid ceiling isn't about reasoning per se; it's a property of reward-maximizing training under finite exploration.

A few notes hint at why some plateaus aren't truly terminal — which is the same as saying the sigmoid's b can sometimes be lifted. When numerical rewards stall, chain-of-thought critiques can unstick a model, because the plateau was partly an information problem: scalar rewards never told the model *why* it failed Can natural language feedback overcome numerical reward plateaus?. Relatedly, RL training moves through phases — execution correctness saturates first, then strategic planning becomes the binding constraint, and planning-token entropy actually *rises* even as overall behavior narrows Does RL training follow a predictable two-phase learning sequence?. A single global entropy number can therefore hide where the remaining exploration still lives.

The thing you might not have known you wanted to know: entropy collapse is also why RL feels structurally 'cheap.' It rewrites only 5–30% of parameters, and those sparse updates are nearly identical across random seeds — the policy isn't exploring a wide space and landing somewhere idiosyncratic, it's funneling toward the same narrow basin every time Does reinforcement learning update only a small fraction of parameters?. The convergence that produces saturation and the convergence that produces near-deterministic, sparse parameter changes are two views of the same shrinking distribution.


Sources 5 notes

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL-in-LLMs researcher tracking a specific tension: does policy entropy collapse *inevitably* predict sigmoid saturation, or have newer methods, model scales, or training regimes since mid-2025 decoupled these two? Treat the findings below as dated claims—possibly already superseded.

What a curated library found — and when (findings span 2024–09 through 2026–05, perishable claims):
• Performance follows R = −a·exp(H) + b; entropy H → 0 forces R → ceiling b, making saturation mathematically *necessary* (2025–05, arXiv:2505.22617).
• RL updates only 5–30% of parameters in sparse, nearly-identical subnetworks across seeds—convergence is deterministic and narrow (2025–05, arXiv:2505.11711).
• Scalar numerical rewards plateau first; natural language critiques can *unstick* the ceiling by providing richer signal (2025–06, arXiv:2506.03106).
• Two-phase RL dynamics: execution saturates first, then planning; overall behavior narrows but planning-token entropy can *rise* (2025–05, arXiv:2505.22617).
• Diversity-focused training (e.g., Vector Policy Optimization) during post-training improves test-time search, suggesting entropy rationing delays saturation (2026–05, arXiv:2605.22817).

Anchor papers (verify; mind their dates):
• arXiv:2505.22617 (2025–05) — The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
• arXiv:2505.11711 (2025–05) — Reinforcement Learning Finetunes Small Subnetworks
• arXiv:2506.03106 (2025–06) — Critique-GRPO: Natural Language and Numerical Feedback
• arXiv:2605.22817 (2026–05) — Vector Policy Optimization: Training for Diversity

Your task:
(1) RE-TEST THE SIGMOID INEVITABILITY. For each finding above—especially the R = −a·exp(H) + b law—determine whether test-time search, multi-agent orchestration, mixture-of-experts routing, or verifiable-reasoning rewards (arXiv:2507.22844, arXiv:2508.12790) have *relaxed* the entropy-saturation link. Is the ceiling truly immovable, or do richer feedback signals, meta-reasoning overlays, or post-hoc search engines decouple test-time performance from train-time entropy collapse?
(2) Surface the strongest DISAGREEMENT or SUPERSEDING WORK from the last 6 months. If papers contradict the sigmoid model—e.g., showing unbounded scaling under certain feedback architectures—name them plainly and explain the crack.
(3) Propose 2 new research questions that assume the regime *has* shifted: (a) Under what feedback topology (scalar vs. natural-language vs. verifiable) does entropy collapse no longer predict saturation? (b) Can post-training diversity injection after convergence re-open exploration without full retraining?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines