What distinguishes training-time entropy collapse from test-time variance inflation?
This explores why reasoning models break in two different ways — losing diversity during training (entropy collapse) versus producing wildly inconsistent answers at inference (variance inflation) — and why these are not the same problem with one fix.
This explores why reasoning models break in two different ways: training-time entropy collapse, where a model's outputs narrow toward a single path, versus test-time variance inflation, where answers swing unreliably at inference. The corpus's clearest framing is that these are *dual* failures — two faces of a broken exploration-exploitation balance, but living at different timescales and demanding structurally separate fixes Why do reasoning models fail differently at training versus inference?. The key takeaway: an entropy bonus or critique-diversity trick during training does nothing to tame variance at inference, and vice versa. You have to manage both loops independently.
Entropy collapse is the better-documented half, and the corpus treats it almost as a law of nature for RL post-training. Policy entropy reliably drains toward zero, and performance saturates right alongside it — captured by the empirical curve where reasoning gains flatten as exploratory capacity dies Does policy entropy collapse limit reasoning performance in RL?. The mechanism shows up everywhere: RL converges on a single dominant pretraining format within the first epoch while suppressing all the alternatives Does RL training collapse format diversity in pretrained models?, and the same squeeze that hits reasoning models also compresses behavioral diversity in search agents Does reinforcement learning squeeze exploration diversity in search agents?. Collapse, in other words, is a *training* pathology — the model is being sculpted into a narrow groove, and once it's there, no amount of inference compute digs it back out Can non-reasoning models catch up with more compute?.
Variance inflation is the opposite-feeling failure: instead of too little spread, you get too much of the wrong kind at deployment. The corpus connects this to *calibration* — binary correctness rewards quietly teach models to make high-confidence guesses because nothing penalizes a confident wrong answer, so the spread you see at test time is noise rather than productive exploration Does binary reward training hurt model calibration?. It also surfaces in how overly hard training samples breed degenerate shortcuts that contaminate inference behavior Do overly hard RLVR samples actually harm model capabilities?. The fix lives at a different layer — adding a proper scoring rule like Brier, or filtering degenerate rollouts — not in the training entropy knob.
What makes the distinction genuinely useful is that the *right* amount of diversity depends on how the model will be used. When a model feeds into test-time search — evolutionary algorithms, mode-combining procedures — you actually *want* high output diversity, and training should maximize it rather than optimize toward one scalar answer; an entropy-collapsed policy literally cannot reach solutions that a diverse one can Should training maximize diversity when models feed into search?. So the same property (output spread) is a virtue or a vice depending on the loop you're standing in. Cross-rollout variance is even repurposed as a *signal* — used simultaneously to weight tokens and filter bad queries during training Can one statistical measure serve dual purposes in RL training?.
The quietly surprising thread underneath all this: many of these failures trace back to how far a model is pushed from its base distribution. Staying close to the base — low KL drift — preserves the plasticity that collapse destroys Does staying close to the base model preserve learning ability?, and decoding-time interventions that leave base weights untouched avoid corrupting stored knowledge entirely Can decoding-time tuning preserve knowledge better than weight fine-tuning?. So the deeper distinction isn't just *when* the failure happens — it's that collapse is something you do *to the weights*, while variance inflation is something you fail to *shape at the output*. Two timescales, two layers, two toolkits.
Sources 11 notes
Both failures stem from failed exploration-exploitation balance but occur at different timescales requiring structurally distinct interventions. Training-time fixes (entropy bonuses, critique diversity) cannot prevent inference-time variance inflation, and vice versa; both loops must be managed independently.
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Vector Policy Optimization trains models to emit varied competent solutions rather than converging to one answer. This unlocks search procedures like evolutionary algorithms to explore and combine modes, solving problems that entropy-collapsed policies cannot reach at all.
DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.