SYNTHESIS NOTE

Does self-distillation harm mathematical reasoning performance?

Self-distillation usually improves models while shortening outputs, but mathematical reasoning shows a puzzling exception: performance drops up to 40%. What mechanism explains this counter-intuitive degradation?

Synthesis note · 2026-05-18 · sourced from Training Fine Tuning

Self-distillation has emerged as an effective post-training paradigm — it usually improves performance while shortening reasoning traces, which is a clean win. The paper Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? documents a counter-finding: in mathematical reasoning, self-distillation can reduce response length while degrading performance, with drops of up to 40% on Qwen3, DeepSeek-Distill-Qwen, and Olmo3.

The mechanism is suppression of epistemic verbalization. Strong reasoning models like DeepSeek-R1 frequently express uncertainty mid-trace using tokens like "Wait" or "Hmm." These tokens look like noise — they do not directly advance the argument, they add length without obvious content. The standard intuition is that distilling toward shorter, more confident traces should be an improvement: same answers, less verbosity, lower inference cost.

The empirical finding contradicts this. Removing the uncertainty tokens removes the signal that a reasoning path may be flawed. When the student model is distilled away from epistemic verbalization, it loses the ability to flag and self-correct its own faulty reasoning paths. The shorter, more confident traces are correlated with worse performance on out-of-distribution problems where the model would have benefited from pausing to verbalize doubt.

This reframes "Wait" and "Hmm" tokens. They are not stylistic noise to be optimized away; they are corrective mechanism markers — the surface signature of the model noticing something is off and adjusting course. Compressing the trace by removing them is removing an internal control structure.

The implication for self-distillation design is sharp. Distillation that uses richly-conditioned teachers produces confident concise students. Confident concise students do well on in-distribution problems where confidence is warranted. They fail on out-of-distribution problems where uncertainty would have been the right response. The distillation regime needs to preserve the uncertainty channel, not just optimize for shorter correct outputs.

Inquiring lines that read this note 8

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Can self-distillation reduce catastrophic forgetting in continual learning?

What are the consequences of models training on synthetic data?

How should models express uncertainty rather than forced confident answers?

Why does self-distillation suppress epistemic verbalization in student models?

How can AI systems learn from failures without cascading errors?

How do failure examples improve distillation compared to successful trajectories alone?

Why does self-revision increase model confidence while degrading accuracy?

How does self-distillation degrade reasoning by suppressing uncertainty signals?

What makes weaker teacher models effective for stronger student training?

How can distillation preserve uncertainty expression instead of optimizing it away?

Can language model RL training avoid reward hacking and misalignment?

Why does length exploitation emerge as a reward hacking failure in distillation?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 147 in 2-hop network ·dense cluster Open in graph ↗

Does self-distillation harm mathematical reasoni… Does richer teacher context hurt student generaliz… Can post-training objectives preserve reasoning st… Do reflection tokens carry more information about … Does chain-of-thought reasoning reveal genuine inf…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does richer teacher context hurt student generalization? When teachers are given more information during distillation, they produce confident but brittle students. Does this trade-off between in-domain wins and out-of-distribution robustness hold across different task distributions?
same paper, the mechanism that produces the degradation
Can post-training objectives preserve reasoning style alongside correctness? Even mathematically sound training objectives may suppress reasoning behaviors like uncertainty expression without penalizing them. Does optimizing for answer correctness inadvertently degrade the stylistic features that enable generalization?
same paper, the broader methodology implication
Do reflection tokens carry more information about correct answers? Explores whether tokens expressing reflection and transitions concentrate information about reasoning outcomes disproportionately compared to other tokens, and what role they play in reasoning performance.
directly supports: empirical evidence that Wait/Hmm/Therefore tokens carry disproportionate information; this paper shows what happens when they are suppressed
Does chain-of-thought reasoning reveal genuine inference or pattern matching? Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
adjacent: the broader CoT critique frame

Does self-distillation harm mathematical reasoning performance?

Inquiring lines that read this note 8

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4