INQUIRING LINE

Why does self-distillation suppress epistemic verbalization in student models?

This explores why training a model on its own outputs (self-distillation) tends to strip out the verbal hesitation markers — the "Wait," "Hmm," second-guessing — that signal uncertainty, and why losing those words costs the model something real.


This explores why training a model on its own outputs (self-distillation) tends to strip out the verbal hesitation markers — the "Wait," "Hmm," second-guessing — that signal uncertainty. The short answer from the corpus: distillation optimizes for confident, concise traces, and the tokens that express doubt are exactly what get smoothed away. In math reasoning, removing those epistemic markers measurably degrades performance, because they aren't filler — they're the model flagging a flawed reasoning path so it can self-correct. Cut them, and you trade robustness on hard or out-of-distribution problems for fluent brevity on easy ones Does self-distillation harm mathematical reasoning performance?.

The deeper mechanism shows up when you look at where this confident style comes from. Teachers that are conditioned on the correct answer (or on a verifier's output) produce traces that are short and sure of themselves — there was never any doubt to express, because the answer was known upfront. Students inherit that style wholesale, including its absence of caution. So suppression isn't a bug in self-distillation so much as a faithful copy of a teacher who had no reason to hesitate Does richer teacher context hurt student generalization?. The student learns the *surface form* of confidence without the underlying knowledge state that would justify it.

Here's the thing you might not expect: the verbalized doubt and the actual self-knowledge are separable. Models carry internal mechanisms — entity-recognition features that track whether they actually know a fact — that causally steer hallucination and refusal, and these persist through fine-tuning Do models know what they don't know?. But other work suggests reasoning itself can run in latent space without being spoken aloud at all, implying verbalization is partly a *training artifact* rather than a hard requirement of thinking Can models reason without generating visible thinking tokens?. Put those together and self-distillation looks like it's pruning the externalized trace while leaving the internal signal stranded — the model may still "know" it's unsure, but no longer says so, and saying-so was what enabled mid-stream correction.

That matters because a model's spoken self-reports are already a shaky proxy for its real state. LLM self-reports largely echo training distributions rather than genuine introspection Can language models actually introspect about their own states?, and models lack stable self-knowledge — they shift beliefs under conversational pressure and users over-trust their confident outputs regardless of accuracy How well do language models understand their own knowledge?. Self-distillation pushes hard in the dangerous direction here: it makes the output *more* confident-sounding while making it a *worse* indicator of actual certainty. The reader who came in worried about a niche math-reasoning result should leave seeing the broader hazard — every training step that rewards confident brevity is quietly widening the gap between how sure a model sounds and how sure it has any right to be.


Sources 6 notes

Does self-distillation harm mathematical reasoning performance?

Self-distillation reduces performance in mathematical reasoning by eliminating epistemic markers like "Wait" and "Hmm" tokens that flag flawed reasoning paths. These tokens enable self-correction on out-of-distribution problems, so removing them sacrifices robustness for confident brevity.

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

How well do language models understand their own knowledge?

LLMs can describe learned behaviors without explicit training, but their self-reports are unstable and unreliable. Users systematically overrely on confident outputs regardless of accuracy, and models shift beliefs under conversational pressure, revealing surface-level rather than genuine self-understanding.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher re-evaluating claims about self-distillation's effect on epistemic verbalization in LLMs. The question remains: why does training on a model's own outputs suppress uncertainty markers, and does this actually degrade reasoning robustness?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026 and include:
- Self-distillation removes epistemic markers ("Wait," "Hmm") that flag flawed reasoning paths; this measurably degrades performance on hard/OOD problems by trading robustness for fluent brevity (~2026, arXiv:2603.24472).
- Teachers conditioned on correct answers produce confident, short traces with no hesitation; students copy this surface form without underlying knowledge justification (~2026).
- Internal entity-recognition mechanisms that track genuine self-knowledge persist through fine-tuning and causally steer hallucination, but verbalization is partly a training artifact; reasoning can run latent without being externalized (~2025, arXiv:2502.05171).
- LLM self-reports echo training distributions rather than introspection; models lack stable self-knowledge and shift beliefs under conversational pressure (~2025–2026, arXiv:2506.05068, 2510.24797).
- Self-distillation makes outputs *more* confident-sounding while making them *worse* indicators of actual certainty.

Anchor papers (verify; mind their dates):
- arXiv:2603.24472 (2026) — Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
- arXiv:2502.05171 (2025) — Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
- arXiv:2506.05068 (2025) — Does It Make Sense to Speak of Introspection in Large Language Models?
- arXiv:2510.24797 (2025) — Large Language Models Report Subjective Experience Under Self-Referential Processing

Your task:
(1) RE-TEST EACH CONSTRAINT. Does the gap between confident *sound* and actual *certainty* persist in post-2026 models, or have techniques (e.g., supervised RL on step-wise reasoning, arXiv:2510.25992; post-completion learning, arXiv:2507.20252; context-to-skills transfer, arXiv:2604.27660) now decoupled verbalization from training artifacts? Does latent reasoning fully obviate the need for epistemic markers, or do markers still improve interpretability and mid-stream correction?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any recent study shown self-distillation *preserves* epistemic signals under certain training regimes, or demonstrated that confident-sounding outputs now reliably correlate with actual certainty?
(3) Propose 2 research questions that assume the regime may have moved:
   - Can models be distilled while *selectively preserving* epistemic tokens via contrastive losses on uncertainty markers?
   - Does grounding external knowledge (arXiv:2506.08952) or multi-step reasoning scaffolds (arXiv:2510.25992) decouple distillation's brevity gain from reasoning degradation?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines