INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do different training strategi…›How does policy entropy collapse c…›this inquiring line

The moment an AI locks onto a single strategy and stops exploring, its performance hits a ceiling it can't break through.

How does entropy collapse in reinforcement learning differ from entropy maintenance in graph reasoning?

This explores a contrast between two ways 'entropy' shows up in the corpus: as a failure mode in reinforcement learning (policies collapsing toward a single narrow strategy) versus as something deliberately preserved when a model navigates a structure like a knowledge graph under uncertainty.

This explores how 'entropy' plays opposite roles in two settings — as a thing that *collapses* and ruins RL training, and as a thing you'd want to *keep alive* when a model reasons its way through a graph. Worth saying up front: the corpus has a deep, well-developed account of the first and only an oblique account of the second, so the cleaner framing is that entropy collapse is a documented disease, and 'maintenance' is the family of cures and the conditions under which exploration must be protected.

On the collapse side, the corpus is unusually crisp. Policy entropy collapse is described as *the* primary bottleneck in scaling RL for reasoning: there's an empirical law where performance saturates exactly as policy entropy drops toward zero, because the model stops exploring and converges on whatever narrow strategy maximizes reward Does policy entropy collapse limit reasoning performance in RL?. The same mechanism is shown to recur in search agents — RL squeezes behavioral diversity the same way it does in reasoning, while supervised fine-tuning on diverse demonstrations keeps exploration broad Does reinforcement learning squeeze exploration diversity in search agents?. And it's not one problem but two: training-time entropy collapse and test-time variance inflation are *dual* failures of exploration-exploitation balance that need structurally separate fixes Why do reasoning models fail differently at training versus inference?.

What makes the 'maintenance' angle interesting is that entropy is not uniformly bad to lose — the corpus shows it should be preserved *selectively*. Only about 20% of tokens are high-entropy 'forking points' where the real reasoning decisions happen, and RLVR essentially works by adjusting those; train on just the forking tokens and you match full updates Do high-entropy tokens drive reasoning model improvements?. Even better, entropy isn't monolithic across a trajectory: RL training moves through two phases where *execution* entropy stabilizes (you want that part to converge) while *planning* entropy actually rises as strategic exploration becomes the bottleneck Does RL training follow a predictable two-phase learning sequence?. So healthy training looks like collapse in the routine parts and maintained entropy at the decision points.

This is exactly where graph reasoning enters. When a model navigates a knowledge graph, the corpus replaces exhaustive whole-graph reading with a *learned traversal policy* — Graph-O1 uses MCTS plus RL to step through the graph selectively, explicitly trading certainty about the full structure for decision-making under uncertainty Can learned traversal policies beat exhaustive graph reading?. Each branching node in a traversal is a forking point in the same sense as the high-entropy tokens above: collapse your exploration too early and you commit to a bad path; the tree search exists precisely to *maintain* a spread of live options. So the difference you're asking about is less a contradiction than a question of *where*: collapse is what you fight at the decision frontier (planning tokens, graph branch points), and convergence is what you want everywhere else (execution, settled sub-paths).

The reader-takeaway worth carrying away: the cures for collapse and the demands of graph reasoning are the same idea seen twice. Whether it's natural-language critiques injecting fresh signal to break a reward plateau Can natural language feedback overcome numerical reward plateaus?, or a tree search keeping multiple graph paths open, both are mechanisms for keeping the model's exploratory capacity alive at exactly the points where premature certainty would be fatal — and letting it collapse safely where the answer is already routine.

Sources 7 notes

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Why do reasoning models fail differently at training versus inference?

Both failures stem from failed exploration-exploitation balance but occur at different timescales requiring structurally distinct interventions. Training-time fixes (entropy bonuses, critique diversity) cannot prevent inference-time variance inflation, and vice versa; both loops must be managed independently.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Show all 7 sources

Can learned traversal policies beat exhaustive graph reading?

Graph-O1 replaces whole-graph ingestion with step-by-step agentic navigation using Monte Carlo Tree Search and reinforcement learning. This approach fits within LLM context windows while learning domain-specific traversal policies, though it trades certainty about the full graph for decision-making under uncertainty.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR3.37 match · arxiv ↗
RAGEN-2: Reasoning Collapse in Agentic RL3.37 match · arxiv ↗
Teaching Large Language Models to Reason with Reinforcement Learning2.48 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin2.47 match · arxiv ↗
Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?2.46 match · arxiv ↗
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models1.73 match · arxiv ↗
Emergent Hierarchical Reasoning In LLMs Through Reinforcement Learning1.69 match · arxiv ↗
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?1.65 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher re-testing claims about entropy dynamics in RL-trained reasoning systems. The question remains: *how* and *where* should entropy collapse versus be maintained in LLM reasoning, especially across graph traversal?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2025–May 2026. A library of 12 papers on RL reasoning documents:
- Policy entropy collapse is the *primary bottleneck* in RL scaling for reasoning; performance saturates as entropy drops to zero (~2505.22617).
- Only ~20% of tokens are high-entropy 'forking points' driving reasoning; selective entropy preservation at decision nodes (not whole-trajectory) is the key (~2506.01939).
- RL training exhibits a two-phase dynamic: *execution* entropy stabilizes (convergence is desired) while *planning* entropy rises as strategic exploration becomes critical (~2505.22617).
- Graph reasoning replaces exhaustive reading with learned traversal policies (MCTS + RL); tree search maintains exploration spread at branch points to avoid premature path collapse (~2505.22617, via Graph-O1 framing).
- Natural-language feedback breaks RL performance plateaus caused by reward saturation, injecting fresh signal to re-ignite exploration (~2506.03106).

Anchor papers (verify; mind their dates):
- 2505.22617 The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
- 2506.01939 Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning
- 2506.03106 Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
- 2605.22817 Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, assess whether newer model scales, training recipes (mixed reward signals, curriculum RL), inference methods (speculative decoding, adaptive compute), or evaluation harnesses have since *relaxed* the bottleneck or revealed it as regime-specific. Separate the durable structural insight (entropy dynamics at decision points are real) from perishable limitation (20% forking-point threshold, two-phase timing). Cite what has shifted it.
(2) Surface the strongest *reconciling* work from the last ~6 months that bridges entropy collapse (bad) and entropy maintenance (good) — or any paper that shows the two operate under different constraints (model scale, task structure, reward design) than the library suggested.
(3) Propose 2 research questions that assume the regime has moved: e.g., does *adaptive* entropy targets (rather than fixed preservation at forking points) unlock longer reasoning chains? Does multi-agent coordination change when entropy is distributed across sub-agents?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

The moment an AI locks onto a single strategy and stops exploring, its performance hits a ceiling it can't break through.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8