INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do different training strategi…›How does policy entropy collapse c…›this inquiring line

Does an AI's willingness to explore at deployment depend almost entirely on how much freedom it kept during training?

How does policy entropy during training affect search discipline during inference?

This explores whether the entropy a policy keeps (or loses) during RL training carries over to how broadly a model explores when it searches at inference time — i.e., does training-time collapse make inference-time search narrow and brittle?

This explores whether the entropy a policy keeps (or loses) during RL training shapes how disciplined — or how narrow — a model's search behavior is at inference time. The corpus suggests the link is direct: the exploration breadth you have at deployment is largely the breadth you protected during training, not something the model recovers on its own when given more compute.

The anchor is the finding that policy entropy collapse is the main bottleneck in RL scaling for reasoning Does policy entropy collapse limit reasoning performance in RL?. There's a clean empirical law — performance saturates as entropy approaches zero — because the policy converges onto a few reward-maximizing trajectories and stops trying alternatives. Interventions like Clip-Cov and KL-Cov exist precisely to slow that collapse and keep exploratory capacity alive. The striking part is that this same mechanism shows up in search agents specifically: RL training squeezes exploration diversity in search just as it does in reasoning, with policies narrowing onto a single confident path, while SFT on diverse demonstrations preserves the breadth Does reinforcement learning squeeze exploration diversity in search agents?. So 'search discipline' at inference isn't only a decoding-time setting — it's inherited from how much entropy survived training.

Where the entropy lives matters as much as how much there is. Only about 20% of tokens are high-entropy 'forking points,' and RLVR does most of its useful work by adjusting exactly those pivotal decision tokens Do high-entropy tokens drive reasoning model improvements?. Training on that minority matches full updates. Read alongside the entropy-collapse law, this reframes the problem: collapse isn't uniform — it's the flattening of these specific branch points, and once they flatten, the model stops exploring the alternatives that branch points exist to choose between. The two-phase view sharpens this further: RL training first consolidates execution (which stabilizes its entropy) and only later opens up strategic planning, where planning-token entropy actually *rises* and becomes the new bottleneck Does RL training follow a predictable two-phase learning sequence?. Healthy search discipline, then, looks like low entropy on the mechanical steps and preserved entropy on the strategic forks — not entropy minimized everywhere.

The payoff for the inference side is the finding that training regime beats inference compute budget: a model trained to keep productive exploration outperforms one given unlimited tokens at deployment, because the extra tokens are only useful if the policy knows how to spend them exploring Can non-reasoning models catch up with more compute?. A collapsed policy handed more inference compute just repeats its narrow path more confidently. This connects to the deeper claim that base models already contain latent exploratory reasoning that post-training selects rather than creates Do base models already contain hidden reasoning ability? — meaning entropy collapse during RL can actively *prune away* search behaviors the model was capable of before training narrowed it.

The thing worth taking away: 'search discipline' is a double-edged phrase. You want the model disciplined enough to stop flailing on routine steps, but entropy collapse can over-discipline it — quietly deleting the branch-point diversity that makes inference-time search worth running at all. The corpus frames good training not as minimizing entropy but as managing *where* it collapses.

Sources 6 notes

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Show all 6 sources

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR3.37 match · arxiv ↗
RAGEN-2: Reasoning Collapse in Agentic RL2.53 match · arxiv ↗
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models1.73 match · arxiv ↗
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models1.70 match · arxiv ↗
Emergent Hierarchical Reasoning In LLMs Through Reinforcement Learning1.69 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin1.67 match · arxiv ↗
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning0.92 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools0.89 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-testing claims about policy entropy and inference-time search discipline. The question: does entropy *retained* during RL training determine search breadth at deployment, or have newer models, methods, or eval frameworks since relaxed this constraint?

What a curated library found — and when (findings span 2024–2026; treat as dated claims):
• Policy entropy collapse is the primary bottleneck in RL scaling for reasoning; performance saturates as entropy→0, forcing interventions like Clip-Cov and KL-Cov to preserve exploratory capacity (~2025).
• Only ~20% of tokens are high-entropy 'forking points'; RL does most useful work by adjusting precisely those pivotal decision tokens, and training on that minority matches full updates (~2025).
• RL training exhibits two-phase dynamics: procedural consolidation (low entropy) precedes strategic planning (entropy *rises* on planning tokens); collapse is not uniform (~2025).
• Training regime beats inference compute budget—a model trained to preserve productive exploration outperforms one given unlimited tokens, because collapsed policies cannot use extra tokens to explore (~2025).
• Base models already contain latent exploratory reasoning; post-training selects rather than creates, meaning entropy collapse can prune away capabilities the base model possessed (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2505.22617 (2025-05) — The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
• arXiv:2506.01939 (2025-06) — Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning
• arXiv:2510.07364 (2025-10) — Base Models Know How to Reason, Thinking Models Learn When
• arXiv:2605.22817 (2026-05) — Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o3, o4, scaling beyond 2025 limits), training methods (curriculum, multi-objective RL, hybrid SFT+RL), tooling (inference-time adaptive sampling, dynamic entropy budgets), or evals (adversarial search tasks, out-of-distribution reasoning) have since relaxed or overturned it. Separate the durable claim ('entropy during training shapes inference search') from perishable specifics (e.g., 'only 20% forking tokens'; 'two-phase order'). Where a constraint still holds, say plainly.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—e.g., evidence that inference-time compute or decoding tricks *do* overcome a collapsed policy, or that entropy collapse is not actually the bottleneck (cite arXiv ID).
(3) Propose 2 research questions that assume the regime may have moved: e.g., 'Can adaptive entropy budgeting per token-type during inference recover search diversity loss from training?' or 'Do mixture-of-experts or retrieval-augmented reasoning bypass the training-entropy constraint?'

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does an AI's willingness to explore at deployment depend almost entirely on how much freedom it kept during training?

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8