SYNTHESIS NOTE

Why do reasoning models fail differently at training versus inference?

Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.

Synthesis note · 2026-02-20 · sourced from Test Time Compute

Two failure modes in the test-time scaling literature look unrelated but share the same underlying mechanism: failed exploration-exploitation balance.

Policy entropy collapse (training time): When RL trains a reasoning model, policy entropy drops over time — the model converges to a narrow repertoire of reasoning paths, sacrificing diversity for short-term reward. The result is a model that's overfit to familiar problem types and struggles to explore novel solution strategies. The fix lives in training: entropy bonuses, diverse critique models, or curriculum design that maintains distributional breadth.

Variance inflation (inference time): When a reasoning model is given an extended thinking budget beyond its optimum, output variance inflates rather than quality improving. The model doesn't converge on the right answer; it oscillates between candidates. The exploration mechanism that training instilled becomes runaway oscillation without the stabilizing feedback of a verifier. The fix lives in inference: parallel sampling instead of sequential extension, confidence-based filtering, or hard token budgets.

Both failures are manifestations of the same underlying problem: the model is neither confidently right nor productively exploring — it's stuck in an uncertain middle state that wastes compute without generating signal. But because they occur at different timescales, the interventions are completely different:

| Failure | Timescale | Mechanism | Fix | |---------|-----------|-----------|-----| | Entropy collapse | Training | Policy over-narrows | Critique diversity, entropy bonuses | | Variance inflation | Inference | Thinking over-extends | Parallel sampling, token limits |

The practical implication: optimizing inference alone (parallel vs sequential, budget allocation) cannot fix a training-time entropy problem. Conversely, training for exploration diversity cannot prevent inference-time variance inflation if the token budget is set too high. Both loops must be managed independently.

Historical vs batch exploration: The Outcome-based Exploration paper adds taxonomic precision to this dual problem. Historical exploration (visiting diverse states during training) improves pass@1 via expanded training signal — this is the training-time fix. Batch exploration (producing diverse outputs at test time) improves pass@k via broader solution coverage — this is the test-time fix. The mechanisms are structurally different: UCB-style bonuses over outcome space for historical exploration, within-batch repetition penalties for batch exploration. This maps the training/test-time dual directly onto concrete algorithmic prescriptions. See Does outcome-based RL diversity loss spread across unsolved problems?.

Training data format as an upstream entropy variable: Does training data format shape reasoning strategy more than domain? adds a third factor upstream of both. Multiple-choice training produces BFS-like (breadth-first, parallel-path) reasoning; free-form training produces DFS-like (depth-first, sequential) reasoning. Format shapes the default exploration profile before any RL training begins. This means entropy collapse is not solely a training-time problem — it can be seeded by format choices in the pre-RL training data. A model pre-trained on free-form data starts the RL phase with a depth-first, collapse-prone default strategy. A model pre-trained on multiple-choice data starts with a more diverse exploration strategy. The intervention sequence is thus: format decisions → exploration profile → RL collapse rate. Managing entropy requires attending to all three.

Inquiring lines that read this note 14

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

How can AI systems learn from failures without cascading errors?

Does domain specialization cause models to lose capabilities elsewhere?

What distinguishes domain-specific failure modes from general model limitations?

How should models express uncertainty rather than forced confident answers?

Do base models and reasoning models fail in opposite directions on uncertainty?

What are the consequences of models training on synthetic data?

Does model collapse occur across different architectures or only in specific conditions?

Why does training format shape reasoning strategy more than domain content?

Does training data format determine whether models collapse entropy or inflate variance?

How do policy learning algorithm choices affect multi-objective optimization stability?

Why does vanilla GRPO cause mode collapse in hybrid reasoning settings?

Related concepts in this collection 11

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

23 direct connections · 230 in 2-hop network ·dense cluster Open in graph ↗

Why do reasoning models fail differently at trai… Does policy entropy collapse limit reasoning perfo… Does extended thinking actually improve reasoning … Do critique models improve diversity during traini… Why does parallel reasoning outperform single chai… Why do LLMs generate novel ideas from narrow range… Does RL training collapse format diversity in pret… Does self-consistency reliably reward correct answ… How quickly do errors compound during model self-t…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
the training-time failure
Does extended thinking actually improve reasoning or just increase variance? When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
the inference-time failure
Do critique models improve diversity during training itself? Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
the training-side intervention
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
the inference-side intervention
Why do LLMs generate novel ideas from narrow ranges? LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
third instance of the same mechanism at generation time: goal-directed optimization pressure narrows output diversity even when average quality is high; suggests the exploration-exploitation failure is not TTS-specific but a general property of optimization pressure on LLM outputs
Does RL training collapse format diversity in pretrained models? Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.
adds a format-level selection mechanism upstream of entropy collapse: RL does not just narrow diversity within a distribution but selects which pretraining distribution survives, making format collapse a precondition for the within-distribution entropy collapse this note describes
Does self-consistency reliably reward correct answers during training? Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?
self-consistency as reward is a specific mechanism that drives training-time entropy collapse: optimizing for inter-sample agreement directly incentivizes the model to narrow its output distribution, making this reward signal an active cause of the collapse problem rather than merely vulnerable to it
How quickly do errors compound during model self-training? When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
a third training-time failure mode distinct from both entropy collapse and variance inflation: entropy collapse is diversity loss, error avalanching is accuracy degradation from compounding self-training errors — both operate at training time but through different mechanisms and require different interventions (entropy bonuses vs external verification)
Why do reasoning LLMs fail at deeper problem solving? Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.
inference-time instantiation: wandering is the behavioral consequence of the exploration-exploitation failure at test time; training-time entropy collapse narrows the strategy repertoire, inference-time wandering is the symptom
Do reasoning models switch between ideas too frequently? Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
a third face of the exploration-exploitation problem: underthinking is insufficient exploitation (abandoning promising paths), entropy collapse is insufficient exploration (narrowing strategies), variance inflation is runaway exploration (oscillating without convergence)
Why do models produce less uncertain outputs on their own text? Post-trained language models show 3-4x lower output entropy when continuing their own generations versus prefilled text. This explores what mechanism drives that confidence gap and whether it reflects genuine self-recognition.
extends: adds a third entropy regime — on-policy vs off-policy recognition — beyond train/test

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

training-time entropy collapse and test-time variance inflation are dual problems requiring different solutions

Why do reasoning models fail differently at training versus inference?

Inquiring lines that read this note 14

Related concepts in this collection 11

Related papers in this collection 8

Search by related questions 4