INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do different training strategi…›How does policy entropy collapse c…›this inquiring line

An AI can lock into rigid patterns during training, then give wildly inconsistent answers in use — and fixing one doesn't fix the other.

How does inference variance differ from training entropy collapse?

This explores why reasoning models break in two different ways — losing diversity during training (entropy collapse) versus producing wildly inconsistent answers at deployment (inference variance) — and why those are not the same problem.

This explores why reasoning models break in two different ways — too little variety during training, too much during use — and why fixing one doesn't fix the other. The cleanest framing in the corpus calls these *dual* failures of the same underlying thing: the balance between exploring new options and exploiting known-good ones. They just show up at opposite ends of the pipeline and on different clocks Why do reasoning models fail differently at training versus inference?.

Entropy collapse is a *training* pathology. As a model is optimized with reinforcement learning, its output distribution narrows — it grows confident and stops trying alternatives. There's even a predictable ceiling to it: performance saturates as policy entropy heads toward zero, and once the model stops exploring, it simply can't get better no matter how much more you train Does policy entropy collapse limit reasoning performance in RL?. This collapse isn't random; it concentrates on the small minority of high-entropy "forking" tokens where reasoning decisions actually get made — squeeze those and you've squeezed the learning signal Do high-entropy tokens drive reasoning model improvements?. It also shows up as format collapse, where RL quietly funnels the model onto a single dominant answer style and suppresses the others it knew from pretraining Does RL training collapse format diversity in pretrained models?.

Inference variance is the opposite complaint, and it happens after training is done. Here the model produces *too much* spread — ask the same question a few times and get answers of wildly different quality. The important insight is that a training-time fix (entropy bonuses, keeping diverse critiques alive) does nothing for this, and an inference-time fix does nothing for collapse. They're separate control loops that each have to be managed on their own terms Why do reasoning models fail differently at training versus inference?. So "add more entropy" is not a universal cure — depending on which failure you have, more entropy is either the medicine or the disease.

What ties the two together is that both are really about a model's relationship to its own uncertainty. Post-trained models actually track how surprising their input is and lower their output entropy on text they generated themselves — an implicit, unverbalized confidence signal baked into the distribution Why do models produce less uncertain outputs on their own text?. That same machinery can be miscalibrated from either side: train on impossible problems and the model collapses onto degenerate shortcuts that masquerade as confidence Do overly hard RLVR samples actually harm model capabilities?, or distill from an over-informed teacher and the student inherits false certainty that looks fine in-domain but shatters into bad variance the moment it meets out-of-distribution problems Does richer teacher context hurt student generalization?.

The thing worth carrying away: confidence and diversity aren't virtues or vices on their own. Collapse is the failure of a model that became too sure during training; variance is the failure of a model that stays unsure at the wrong moments during use — and because they live at different points in the lifecycle, you have to diagnose which one you're actually looking at before reaching for a fix.

Sources 7 notes

Why do reasoning models fail differently at training versus inference?

Both failures stem from failed exploration-exploitation balance but occur at different timescales requiring structurally distinct interventions. Training-time fixes (entropy bonuses, critique diversity) cannot prevent inference-time variance inflation, and vice versa; both loops must be managed independently.

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Why do models produce less uncertain outputs on their own text?

Post-trained models produce 3-4x lower output entropy on their own generations, driven by an internal representation of input surprise that causally modulates confidence. This implicit self-recognition signal appears without being verbalized, encoded directly in the output distribution.

Show all 7 sources

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **How do inference-time variance and training-time entropy collapse represent distinct failure modes, and what are the current constraints on controlling each independently?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. The library identified these core tensions:

• Entropy collapse during RL training concentrates on high-entropy "forking" tokens (~5–10% of sequence); squeezing these kills learning signal; performance plateaus as policy entropy→0 (2025-06, 2025-05).
• Inference variance persists even after training fixes: same input yields wildly different answer quality; training-time entropy bonuses do NOT reduce test-time variance (2025-04).
• Post-trained models implicitly calibrate confidence via on-policy entropy: output entropy 3–4× lower on self-generated vs. off-policy text; miscalibration from hard RL samples or over-informed teachers propagates as false certainty (2026-03, 2025-01).
• RL post-training amplifies pretraining format modes; single dominant style suppresses learned alternatives; this locks in during training, not inference (2025-04).
• Both failures are dual aspects of uncertainty management, not a single lever: collapse = overtrained confidence; variance = undertrained calibration (2025-09).

Anchor papers (verify; mind their dates):
• arXiv:2505.22617 (2025-05) — The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
• arXiv:2506.01939 (2025-06) — Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning
• arXiv:2603.24472 (2026-03) — Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
• arXiv:2509.23808 (2025-09) — Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For collapse: do newer RL methods (e.g., curriculum sampling, entropy-aware reward shaping, mixture-of-experts routing) still fail to prevent forking-token squeeze, or has per-token entropy control become tractable? For variance: do recent test-time methods (beam search variants, ensemble decoding, dynamic temperature scaling, in-context calibration) now reliably reduce variance without retraining? Separate the durable question (likely: how to maintain diversity in *high-leverage* tokens without sacrificing convergence) from perishable limits (e.g., "entropy bonuses don't help variance" — check if adaptive, online entropy adjustment changes this).

(2) **Surface strongest CONTRADICTING or SUPERSEDING work** from last ~6 months. If any paper shows that collapse and variance *are* unified by a single mechanism (e.g., a learned uncertainty model that governs both), or that one subsumes the other under new training paradigms, flag it explicitly and explain the disagreement.

(3) **Propose 2 research questions that ASSUME the regime may have moved:**
   - If high-entropy token control is now fine-grained (e.g., per-layer or per-attention-head entropy masking), can we *trade* training convergence speed for inference variance reduction in a principled way?
   - Do models that learn *explicit* confidence scores (vs. implicit entropy) during RL still exhibit the dual failure, or does metacognitive supervision decouple collapse from variance?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

An AI can lock into rigid patterns during training, then give wildly inconsistent answers in use — and fixing one doesn't fix the other.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8