Does self-distillation harm mathematical reasoning performance?
Self-distillation usually improves models while shortening outputs, but mathematical reasoning shows a puzzling exception: performance drops up to 40%. What mechanism explains this counter-intuitive degradation?
Self-distillation has emerged as an effective post-training paradigm — it usually improves performance while shortening reasoning traces, which is a clean win. The paper Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? documents a counter-finding: in mathematical reasoning, self-distillation can reduce response length while degrading performance, with drops of up to 40% on Qwen3, DeepSeek-Distill-Qwen, and Olmo3.
The mechanism is suppression of epistemic verbalization. Strong reasoning models like DeepSeek-R1 frequently express uncertainty mid-trace using tokens like "Wait" or "Hmm." These tokens look like noise — they do not directly advance the argument, they add length without obvious content. The standard intuition is that distilling toward shorter, more confident traces should be an improvement: same answers, less verbosity, lower inference cost.
The empirical finding contradicts this. Removing the uncertainty tokens removes the signal that a reasoning path may be flawed. When the student model is distilled away from epistemic verbalization, it loses the ability to flag and self-correct its own faulty reasoning paths. The shorter, more confident traces are correlated with worse performance on out-of-distribution problems where the model would have benefited from pausing to verbalize doubt.
This reframes "Wait" and "Hmm" tokens. They are not stylistic noise to be optimized away; they are corrective mechanism markers — the surface signature of the model noticing something is off and adjusting course. Compressing the trace by removing them is removing an internal control structure.
The implication for self-distillation design is sharp. Distillation that uses richly-conditioned teachers produces confident concise students. Confident concise students do well on in-distribution problems where confidence is warranted. They fail on out-of-distribution problems where uncertainty would have been the right response. The distillation regime needs to preserve the uncertainty channel, not just optimize for shorter correct outputs.
Inquiring lines that use this note as a source 7
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can self-distillation reduce catastrophic forgetting in continual learning?
- How does self-distillation differ from standard fine-tuning approaches?
- Why does self-distillation suppress epistemic verbalization in student models?
- What makes policy self-distillation more effective than external teacher distillation?
- How do failure examples improve distillation compared to successful trajectories alone?
- How does self-distillation degrade reasoning by suppressing uncertainty signals?
- How can distillation preserve uncertainty expression instead of optimizing it away?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does richer teacher context hurt student generalization?
When teachers are given more information during distillation, they produce confident but brittle students. Does this trade-off between in-domain wins and out-of-distribution robustness hold across different task distributions?
same paper, the mechanism that produces the degradation
-
Can post-training objectives preserve reasoning style alongside correctness?
Even mathematically sound training objectives may suppress reasoning behaviors like uncertainty expression without penalizing them. Does optimizing for answer correctness inadvertently degrade the stylistic features that enable generalization?
same paper, the broader methodology implication
-
Do reflection tokens carry more information about correct answers?
Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.
directly supports: empirical evidence that Wait/Hmm/Therefore tokens carry disproportionate information; this paper shows what happens when they are suppressed
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
adjacent: the broader CoT critique frame
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
- Rethinking Thinking Tokens: LLMs as Improvement Operators
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- LLMs can implicitly learn from mistakes in-context
- Can Large Reasoning Models Self-Train?
- SSRL: Self-Search Reinforcement Learning
Original note title
self-distillation can degrade reasoning by suppressing epistemic verbalization — Wait and Hmm tokens carry uncertainty signal not noise