INQUIRING LINE

What distinguishes training-time entropy collapse from test-time variance inflation?

This explores why reasoning models break in two different ways — losing diversity during training (entropy collapse) versus producing wildly inconsistent answers at inference (variance inflation) — and why these are not the same problem with one fix.


This explores why reasoning models break in two different ways: training-time entropy collapse, where a model's outputs narrow toward a single path, versus test-time variance inflation, where answers swing unreliably at inference. The corpus's clearest framing is that these are *dual* failures — two faces of a broken exploration-exploitation balance, but living at different timescales and demanding structurally separate fixes Why do reasoning models fail differently at training versus inference?. The key takeaway: an entropy bonus or critique-diversity trick during training does nothing to tame variance at inference, and vice versa. You have to manage both loops independently.

Entropy collapse is the better-documented half, and the corpus treats it almost as a law of nature for RL post-training. Policy entropy reliably drains toward zero, and performance saturates right alongside it — captured by the empirical curve where reasoning gains flatten as exploratory capacity dies Does policy entropy collapse limit reasoning performance in RL?. The mechanism shows up everywhere: RL converges on a single dominant pretraining format within the first epoch while suppressing all the alternatives Does RL training collapse format diversity in pretrained models?, and the same squeeze that hits reasoning models also compresses behavioral diversity in search agents Does reinforcement learning squeeze exploration diversity in search agents?. Collapse, in other words, is a *training* pathology — the model is being sculpted into a narrow groove, and once it's there, no amount of inference compute digs it back out Can non-reasoning models catch up with more compute?.

Variance inflation is the opposite-feeling failure: instead of too little spread, you get too much of the wrong kind at deployment. The corpus connects this to *calibration* — binary correctness rewards quietly teach models to make high-confidence guesses because nothing penalizes a confident wrong answer, so the spread you see at test time is noise rather than productive exploration Does binary reward training hurt model calibration?. It also surfaces in how overly hard training samples breed degenerate shortcuts that contaminate inference behavior Do overly hard RLVR samples actually harm model capabilities?. The fix lives at a different layer — adding a proper scoring rule like Brier, or filtering degenerate rollouts — not in the training entropy knob.

What makes the distinction genuinely useful is that the *right* amount of diversity depends on how the model will be used. When a model feeds into test-time search — evolutionary algorithms, mode-combining procedures — you actually *want* high output diversity, and training should maximize it rather than optimize toward one scalar answer; an entropy-collapsed policy literally cannot reach solutions that a diverse one can Should training maximize diversity when models feed into search?. So the same property (output spread) is a virtue or a vice depending on the loop you're standing in. Cross-rollout variance is even repurposed as a *signal* — used simultaneously to weight tokens and filter bad queries during training Can one statistical measure serve dual purposes in RL training?.

The quietly surprising thread underneath all this: many of these failures trace back to how far a model is pushed from its base distribution. Staying close to the base — low KL drift — preserves the plasticity that collapse destroys Does staying close to the base model preserve learning ability?, and decoding-time interventions that leave base weights untouched avoid corrupting stored knowledge entirely Can decoding-time tuning preserve knowledge better than weight fine-tuning?. So the deeper distinction isn't just *when* the failure happens — it's that collapse is something you do *to the weights*, while variance inflation is something you fail to *shape at the output*. Two timescales, two layers, two toolkits.


Sources 11 notes

Why do reasoning models fail differently at training versus inference?

Both failures stem from failed exploration-exploitation balance but occur at different timescales requiring structurally distinct interventions. Training-time fixes (entropy bonuses, critique diversity) cannot prevent inference-time variance inflation, and vice versa; both loops must be managed independently.

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Should training maximize diversity when models feed into search?

Vector Policy Optimization trains models to emit varied competent solutions rather than converging to one answer. This unlocks search procedures like evolutionary algorithms to explore and combine modes, solving problems that entropy-collapsed policies cannot reach at all.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining the claim that training-time entropy collapse and test-time variance inflation are dual, structurally separable failures in reasoning model RL post-training. A curated library (papers 2024–2026) framed them as failures at different timescales—collapse sculpts weights into narrow modes; variance inflation corrupts output calibration—each needing independent fixes. Is this distinction still holding, or have recent advances (new model architectures, training recipes, inference-time interventions, or evaluation methods) blurred or collapsed the boundary?

What a curated library found—and when (dated claims, not current truth):
• Policy entropy reliably drains toward zero during RL post-training, and performance saturates in lockstep; collapse is treated as a near-inevitable training pathology (~2025).
• RL converges on a single dominant pretraining distribution format within the first epoch, suppressing alternatives; non-reasoning models cannot recover this diversity even with unlimited inference compute (~2025).
• Test-time variance inflation stems from binary reward RL degrading calibration; a proper scoring rule (Brier, log-loss) or filtering degenerate rollouts is the fix, separate from entropy management (~2024–2025).
• When a model runs inside test-time search (evolutionary, mode-combining), training *should* maximize diversity; entropy collapse becomes a hard constraint on reachability (~2026).
• Low KL drift from base weights preserves plasticity; decoding-time proxies (e.g., proxy-tuning) avoid weight corruption better than fine-tuning (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2505.22617 (2025-05): *The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models*
• arXiv:2509.23808 (2025-09): *Beyond the Exploration-Exploitation Trade-off: A Hidden State Approach for LLM Reasoning*
• arXiv:2605.28388 (2026-05): *Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs*
• arXiv:2605.12484 (2026-05): *Learning, Fast and Slow: Towards LLMs That Adapt Continually*

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each claim (entropy collapse inevitability, single-mode convergence, calibration-via-scoring-rule, diversity-for-search, KL-drift-preserves-plasticity), determine whether newer model families, multi-stage training methods, mixture-of-experts routing, or speculative/cascade decoding have since RELAXED or OVERTURNED it. Distinguish the durable question ("How do we maintain exploratory capacity during RL?") from the perishable limitation ("Binary rewards always kill calibration"). Cite what specifically resolved each constraint, or state plainly where it still appears to hold.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Look for papers that claim entropy collapse is NOT the bottleneck, or that variance inflation can be fixed *during* training rather than at inference, or that the dual-problem framing itself misses a unifying mechanism.
(3) **Propose 2 research questions** that assume the regime may have shifted: e.g., "Does mixture-of-experts routing naturally decouple weight collapse from output variance?" or "Can a single loss term—e.g., entropy + calibration—simultaneously solve both, making them non-dual?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines