INQUIRING LINE

Can self-distillation reduce catastrophic forgetting in continual learning?

This reads the question as: does training a model on its own outputs (self-distillation) help it learn new tasks without overwriting old ones — and the corpus actually pushes back on the premise.


This explores whether self-distillation — training a model on its own generated outputs — is a good tool against catastrophic forgetting, and the most direct evidence in the collection is a caution, not an endorsement. The one note that studies self-distillation head-on finds it can quietly degrade a model: by training on its own confident traces, the model stops producing the uncertainty markers ('Wait', 'Hmm') that flag flawed reasoning, trading robustness for confident brevity and losing the ability to self-correct on unfamiliar problems Does self-distillation harm mathematical reasoning performance?. A closely related finding shows the same trap from the teacher side: when the teacher is conditioned on correct answers, the distilled student inherits confident, concise traces that perform well in-domain but generalize worse out-of-distribution Does richer teacher context hurt student generalization?. So self-distillation isn't a neutral memory-preservation move — it actively reshapes what the model keeps and discards.

What makes this interesting is that the corpus reframes catastrophic forgetting itself. Forgetting isn't an inherent cost of learning new things — it's a misallocation problem. Fast-Slow Training shows that if you route task-specific lessons into the prompt (fast, textual context) and keep parameter updates minimal (slow weights), you get equivalent performance with substantially less forgetting Can splitting adaptation into two channels reduce forgetting?. The mechanism behind why this works also clarifies the self-distillation risk: staying close to the base model's distribution — low KL drift — is what preserves the model's plasticity to keep learning. Models that drift far stall when the task domain shifts Does staying close to the base model preserve learning ability?. Self-distillation, by sharpening the model toward its own confident outputs, tends to increase drift in exactly the direction that costs you future adaptability.

The collection's strongest anti-forgetting strategies all share a different instinct: don't update the weights you want to protect. SoftCoT freezes the main LLM entirely and delegates new reasoning to a small auxiliary model, so pre-trained knowledge can't be overwritten Can continuous reasoning avoid forgetting in instruction-tuned models?. VOYAGER stores new skills in an external, executable library and composes them rather than baking them into weights Can agents learn new skills without forgetting old ones?. AgentFly goes furthest, achieving continual adaptation entirely through episodic memory operations with zero parameter updates Can agents learn continuously from experience without updating weights?. The pattern is consistent — the safest place to put new knowledge is somewhere other than the weights that already hold the old knowledge.

There's one nuance worth pulling out, because it's where self-distillation could plausibly earn its keep. The failure of self-correction training isn't that models learn from themselves — it's that they learn from the *wrong* distribution. Training on offline correction traces fails because those errors don't match the model's actual test-time errors; multi-turn online RL on the model's own live mistakes succeeds Why does self-correction training on offline data fail?. The lesson transfers: self-generated data is dangerous when it's confident, sanitized, and off-distribution, but useful when it's grounded in the model's real behavior — which is also why bidirectional RAG only writes self-generated answers back into its corpus after entailment and novelty checks Can RAG systems safely learn from their own generated answers?.

So the honest answer the corpus supports: self-distillation, as studied here, is more likely to *cause* a subtle form of forgetting — the loss of self-correction and out-of-distribution robustness — than to cure catastrophic forgetting. If your goal is continual learning without forgetting, the collection points you toward separating the storage of new knowledge from old (frozen backbones, fast context channels, external memory) and toward keeping KL drift low, rather than toward distilling the model into a more confident version of itself.


Sources 9 notes

Does self-distillation harm mathematical reasoning performance?

Self-distillation reduces performance in mathematical reasoning by eliminating epistemic markers like "Wait" and "Hmm" tokens that flag flawed reasoning paths. These tokens enable self-correction on out-of-distribution problems, so removing them sacrifices robustness for confident brevity.

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Can RAG systems safely learn from their own generated answers?

Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a continual learning researcher. The question: Does self-distillation meaningfully reduce catastrophic forgetting in LLMs, or does it introduce hidden costs that outweigh the benefit?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. The collection reports:
• Self-distillation degrades reasoning robustness by suppressing epistemic markers ('Wait', 'Hmm'), trading uncertainty awareness for confident brevity (2026).
• Distilled students inherit confident, concise traces that perform well in-domain but fail out-of-distribution; the teacher's conditioning shapes student generalization (2026).
• Catastrophic forgetting is a KL-drift problem: models staying close to base distribution retain plasticity; high drift stalls future task adaptation (2026).
• Anti-forgetting strategies converge on avoiding weight updates: frozen backbones (SoftCoT, 2025), external skill libraries (compositional memory), and episodic memory with zero parameter updates (AgentFly) all outperform weight-based continual learning (2025–2026).
• Self-generated data helps only when grounded in live model behavior (RL on real errors) and validated (entailment checks); offline distillation on sanitized traces fails due to distribution mismatch (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2603.24472 (2026) — Why Does Self-Distillation (Sometimes) Degrade Reasoning?
• arXiv:2605.12484 (2026) — Learning, Fast and Slow: Towards LLMs That Adapt Continually
• arXiv:2502.12134 (2025) — SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs
• arXiv:2409.12917 (2024) — Training Language Models to Self-Correct via Reinforcement Learning

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, determine whether (a) newer models, larger scale, or improved training have since relaxed the epistemic-marker suppression or out-of-distribution brittleness of self-distilled models; (b) whether continual learning benchmarks now show frozen-weight or external-memory methods hitting saturation or overfitting; (c) whether hybrid approaches (self-distillation *with* grounding checks, or distillation *into* external memory) have emerged to reconcile the tradeoff. Separate the durable insight (low KL drift matters for continual learning) from the perishable finding (self-distillation is net harmful).
(2) Surface the strongest CONTRADICTING work from the last ~6 months — any paper claiming self-distillation *does* solve continual forgetting, or showing frozen weights fail where adaptation does succeed.
(3) Propose 2 research questions that assume the regime may have moved: (a) Can self-distillation be salvaged if restricted to *uncertainty-preserving* objectives or to *auxiliary* models rather than the main weights? (b) Do multi-task or meta-learned distillation targets reduce the distribution mismatch that currently makes self-distillation harmful?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines