INQUIRING LINE

How does self-distillation degrade reasoning by suppressing uncertainty signals?

This explores why training a model on its own polished outputs (self-distillation) can make it reason worse — specifically by erasing the hesitation markers and confidence cues that flag when reasoning is going wrong.


This explores why training a model on its own polished outputs (self-distillation) can make it reason *worse* — and the corpus traces it to a single mechanism: the loss of uncertainty signals that a model needs to catch its own mistakes. The core finding is that self-distillation strips out epistemic markers like "Wait" and "Hmm" — the tokens where a model pauses and reconsiders a flawed path Does self-distillation harm mathematical reasoning performance?. Those pauses look like noise if you optimize for confident, concise answers, but they're load-bearing: they enable self-correction on unfamiliar (out-of-distribution) problems. Remove them and you trade robustness for fluent overconfidence.

What makes this more than a one-paper curiosity is that the same trade-off shows up under a different name in teacher–student distillation. When a teacher is conditioned on the correct answer and verifier output, it produces crisp, confident traces — and students inherit that confident style, gaining in-domain accuracy while losing the epistemic caution that generalization to hard, novel problems requires Does richer teacher context hurt student generalization?. Self-distillation is essentially the model becoming its own over-confident teacher. The degradation isn't about losing knowledge; it's about losing the *expression* of doubt.

The deeper puzzle is that the uncertainty information doesn't fully disappear — it stops being verbalized. Models produce 3–4× lower entropy on their own generated text, driven by an internal surprise signal that quietly shapes the output distribution without ever surfacing as a word Why do models produce less uncertain outputs on their own text?. So self-distillation pushes uncertainty from the visible reasoning trace down into silent internal states, where it can no longer trigger the explicit "wait, let me reconsider" behavior that self-correction depends on. There's a related architectural hint that models naturally *do* mark difficulty — hidden states sparsify under out-of-distribution load as an adaptive filter Do language models sparsify their activations under difficult tasks? — which suggests the uncertainty machinery exists but gets muted rather than removed.

The constructive flip side: if suppressing confidence signals breaks reasoning, surfacing them can repair it. Using a model's own answer-span confidence as a reward signal restores calibration while strengthening step-by-step reasoning — the inverse of the distillation pathology Can model confidence work as a reward signal for reasoning?. Confidence variance can even be read live to steer between overthinking and underthinking without any retraining Can confidence patterns reveal overthinking versus underthinking?, and small models explicitly trained to hold and act on uncertainty (by abstaining when unsure) can match models ten times their size Can models learn to abstain when uncertain about predictions?. Calibration, in other words, is a trainable capability that standard pipelines leave underdeveloped — and that self-distillation actively erodes.

The surprising takeaway for a curious reader: longer or more confident reasoning is not automatically better reasoning. Accuracy peaks at an intermediate chain length and declines past it Why does chain of thought accuracy eventually decline with length?, and a small minority of high-entropy "forking" tokens — the very moments of expressed uncertainty — carry most of the learning signal in reasoning models Do high-entropy tokens drive reasoning model improvements?. Self-distillation's quiet harm is that it smooths away exactly those forks, leaving a model that sounds more certain and reasons less safely.


Sources 9 notes

Does self-distillation harm mathematical reasoning performance?

Self-distillation reduces performance in mathematical reasoning by eliminating epistemic markers like "Wait" and "Hmm" tokens that flag flawed reasoning paths. These tokens enable self-correction on out-of-distribution problems, so removing them sacrifices robustness for confident brevity.

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Why do models produce less uncertain outputs on their own text?

Post-trained models produce 3-4x lower output entropy on their own generations, driven by an internal representation of input surprise that causally modulates confidence. This implicit self-recognition signal appears without being verbalized, encoded directly in the output distribution.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can confidence patterns reveal overthinking versus underthinking?

ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-robustness researcher. The question remains open: *Does self-distillation necessarily degrade reasoning, or have newer training methods, eval practices, or architectural choices since ~early 2026 found ways to preserve uncertainty signals while keeping efficiency gains?*

What a curated library found — and when (dated claims, not current truth): Findings span 2024–2026.
• Self-distillation strips epistemic markers ("Wait", "Hmm") that enable self-correction on OOD problems, trading robustness for confident fluency (2026-03, arXiv:2603.24472).
• Teacher–student distillation produces 3–4× lower entropy in student outputs when teacher is conditioned on correct answers; students inherit confident style but lose caution (2026-05).
• Optimal chain-of-thought length follows an inverted-U curve; accuracy declines past intermediate length, yet standard distillation pushes toward shorter, more confident traces (2025-02, arXiv:2502.07266).
• High-entropy "forking" tokens (moments of expressed uncertainty) carry disproportionate learning signal in RL-trained reasoning models; self-distillation smooths them away (2025-06, arXiv:2506.01939).
• Model confidence can be used as intrinsic reward to restore calibration and step-by-step reasoning, reversing the distillation pathology (2026-03, arXiv:2603.12372).

Anchor papers (verify; mind their dates):
• arXiv:2603.24472 (2026-03) — Direct investigation of self-distillation degradation.
• arXiv:2502.07266 (2025-02) — Chain-of-thought length and reasoning optimality.
• arXiv:2506.01939 (2025-06) — High-entropy tokens as learning drivers.
• arXiv:2605.25459 (2026-05) — Models recognizing and reacting to their own uncertainty.

Your task:
(1) RE-TEST THE CORE CONSTRAINT: Has post-training via RL from self-feedback, confidence-as-reward methods, or orchestration techniques (multi-pass verification, branching samplers, adaptive compute) since mid-2026 managed to preserve or recover epistemic signals *without* sacrificing efficiency? Does the inverted-U still hold for modern scaling? Where does uncertainty suppression still bite?
(2) Surface the strongest *contradicting or reconciling* work from the last ~6 months — papers showing either (a) self-distillation can work if uncertainty is explicitly preserved, or (b) the mechanism is different than epistemic-signal loss.
(3) Propose two research questions assuming the regime may have shifted: (i) Can uncertainty be encoded as a *learned latent code* rather than verbalized, allowing distillation without loss? (ii) Do mixture-of-experts or sparse routing naturally preserve forking tokens better than dense distillation?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines