Does richer teacher context hurt student generalization?
When teachers are given more information during distillation, they produce confident but brittle students. Does this trade-off between in-domain wins and out-of-distribution robustness hold across different task distributions?
The self-distillation degradation finding has a clean causal story. When the teacher model is conditioned on richer information — the correct solution, access to a verifier, additional context that humans would not have at inference time — the reasoning trajectories it produces become more confident and more concise. The teacher knows the answer, so it does not bother to express uncertainty mid-trace. The student, distilling toward these traces, inherits the confident style.
The pattern unfolds along two factors: information richness and task coverage. Richer teacher context → confident traces → suppressed epistemic verbalization → faster in-domain optimization. Limited task coverage means the in-domain wins are real and visible; the model gets better at the narrow distribution it was trained on. As task coverage broadens, the missing uncertainty channel becomes a liability — out-of-distribution problems benefit from expressing uncertainty and adjusting accordingly, and the confident-style student no longer has access to that adjustment mechanism.
This produces a counter-intuitive recommendation for distillation pipeline design. Standard intuition: give the teacher as much information as possible so it produces high-quality traces. The finding inverts this: the teacher's traces become too clean, optimized for cases where confidence is warranted, missing the uncertainty markers that help the student handle cases where confidence is not warranted.
A more robust approach lets the teacher operate with less privileged context, producing traces that include the natural pauses and self-corrections of reasoning under uncertainty. The resulting traces are messier, longer, less obviously "polished" — but they preserve the corrective signal that helps OOD performance.
The deeper observation is that style transfer is part of distillation, not just correctness transfer. The student inherits the teacher's reasoning style, including how the teacher handles or hides uncertainty. Teacher conditioning shapes style, and style shapes generalization. Distillation pipelines that optimize teacher conditioning for correctness alone optimize against generalization without realizing it.
Inquiring lines that use this note as a source 37
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does extended exoskeleton use eventually produce meaningful skill transfer?
- Can self-distillation reduce catastrophic forgetting in continual learning?
- How does distributional distance from pre-training relate to model difficulty?
- Why do easy training examples contribute less to model generalization than hard ones?
- What makes asymmetric distillation effective for converting pretrained diffusion models?
- When does knowledge distillation produce student models superior to teachers?
- What conditions make training diversity better than individual expert quality?
- How does self-distillation differ from standard fine-tuning approaches?
- Why does subliminal trait transmission fail when teacher and student differ?
- How does training data distribution create asymmetric competence across relation types?
- How does training data distribution determine what models can learn?
- Why does mixing reasoning traces from different teachers destabilize learning?
- What makes certain bond distributions more learnable than others?
- How does inference variance differ from training entropy collapse?
- Why is offline knowledge distillation preferred when in-session signals matter?
- Why do student models learn better from internal pruning versus external compression?
- Why do weaker teacher models sometimes produce better training signals than stronger ones?
- What filtering criteria best identify student-compatible refinements from teacher models?
- How does distributional shift toward rare inputs change memorization reliance?
- Does training data format determine whether models collapse entropy or inflate variance?
- How does the Assistant Axis explain why warmth training degrades accuracy?
- How does information asymmetry between teacher and student create the learning signal?
- Why does teacher forcing fail to capture long-range dependencies?
- How should training data be constructed to preserve teacher-student information gaps?
- Why does self-distillation suppress epistemic verbalization in student models?
- How do complementary learning systems explain the need for fast and slow consolidation?
- What makes policy self-distillation more effective than external teacher distillation?
- How does uncertainty verbalization change student robustness across domains?
- Can teachers trained under uncertainty constraints distill better generalizing students?
- How do failure examples improve distillation compared to successful trajectories alone?
- How does self-distillation degrade reasoning by suppressing uncertainty signals?
- How can distillation preserve uncertainty expression instead of optimizing it away?
- Why does information asymmetry between teacher and student enable effective feedback learning?
- How does the pretraining distribution shape what LLMs find hard?
- How does upward distillation transfer knowledge from smaller to larger networks?
- Why does negative experience transfer better than positive examples alone?
- What makes some training data teach brittle answers versus robust reasoning?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does self-distillation harm mathematical reasoning performance?
Self-distillation usually improves models while shortening outputs, but mathematical reasoning shows a puzzling exception: performance drops up to 40%. What mechanism explains this counter-intuitive degradation?
same paper, the mechanism this trade-off produces
-
Can post-training objectives preserve reasoning style alongside correctness?
Even mathematically sound training objectives may suppress reasoning behaviors like uncertainty expression without penalizing them. Does optimizing for answer correctness inadvertently degrade the stylistic features that enable generalization?
same paper, the methodology implication
-
What do models actually learn from chain-of-thought training?
When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
adjacent: structural coherence drives learning more than content; here, structural uncertainty signals matter
-
Can agents learn better from their failures than successes?
Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.
partial tension: failures provide useful distillation signal; richer context may suppress visible failure modes
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
- Post-Training Large Language Models via Reinforcement Learning from Self-Feedback
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning
- Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data
Original note title
richer teacher context produces more confident shorter student traces — fast in-domain optimization at the cost of OOD robustness