SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Why do reasoning models fail differently at training versus inference?

Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.

Synthesis note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

Two failure modes in the test-time scaling literature look unrelated but share the same underlying mechanism: failed exploration-exploitation balance.

Policy entropy collapse (training time): When RL trains a reasoning model, policy entropy drops over time — the model converges to a narrow repertoire of reasoning paths, sacrificing diversity for short-term reward. The result is a model that's overfit to familiar problem types and struggles to explore novel solution strategies. The fix lives in training: entropy bonuses, diverse critique models, or curriculum design that maintains distributional breadth.

Variance inflation (inference time): When a reasoning model is given an extended thinking budget beyond its optimum, output variance inflates rather than quality improving. The model doesn't converge on the right answer; it oscillates between candidates. The exploration mechanism that training instilled becomes runaway oscillation without the stabilizing feedback of a verifier. The fix lives in inference: parallel sampling instead of sequential extension, confidence-based filtering, or hard token budgets.

Both failures are manifestations of the same underlying problem: the model is neither confidently right nor productively exploring — it's stuck in an uncertain middle state that wastes compute without generating signal. But because they occur at different timescales, the interventions are completely different:

| Failure | Timescale | Mechanism | Fix | |---------|-----------|-----------|-----| | Entropy collapse | Training | Policy over-narrows | Critique diversity, entropy bonuses | | Variance inflation | Inference | Thinking over-extends | Parallel sampling, token limits |

The practical implication: optimizing inference alone (parallel vs sequential, budget allocation) cannot fix a training-time entropy problem. Conversely, training for exploration diversity cannot prevent inference-time variance inflation if the token budget is set too high. Both loops must be managed independently.

Historical vs batch exploration: The Outcome-based Exploration paper adds taxonomic precision to this dual problem. Historical exploration (visiting diverse states during training) improves pass@1 via expanded training signal — this is the training-time fix. Batch exploration (producing diverse outputs at test time) improves pass@k via broader solution coverage — this is the test-time fix. The mechanisms are structurally different: UCB-style bonuses over outcome space for historical exploration, within-batch repetition penalties for batch exploration. This maps the training/test-time dual directly onto concrete algorithmic prescriptions. See Does outcome-based RL diversity loss spread across unsolved problems?.

Training data format as an upstream entropy variable: Does training data format shape reasoning strategy more than domain? adds a third factor upstream of both. Multiple-choice training produces BFS-like (breadth-first, parallel-path) reasoning; free-form training produces DFS-like (depth-first, sequential) reasoning. Format shapes the default exploration profile before any RL training begins. This means entropy collapse is not solely a training-time problem — it can be seeded by format choices in the pre-RL training data. A model pre-trained on free-form data starts the RL phase with a depth-first, collapse-prone default strategy. A model pre-trained on multiple-choice data starts with a more diverse exploration strategy. The intervention sequence is thus: format decisions → exploration profile → RL collapse rate. Managing entropy requires attending to all three.

Inquiring lines that use this note as a source 13

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 11

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
23 direct connections · 219 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

training-time entropy collapse and test-time variance inflation are dual problems requiring different solutions