SYNTHESIS NOTE

Is LLM forgetting really knowledge loss or alignment loss?

When language models appear to forget old knowledge after learning new tasks, is the underlying knowledge actually gone, or has the model simply lost the ability to activate it? This distinction matters for understanding how fragile safety training really is.

Synthesis note · 2026-02-23 · sourced from Flaws

The conventional story of catastrophic forgetting says LLMs lose old knowledge when learning new tasks. But controlled experiments reveal something different: performance loss does not indicate knowledge loss. It indicates task alignment loss — the model's ability to effectively apply existing knowledge to specific tasks degrades, while the underlying knowledge remains intact.

The evidence is striking: safety alignment established through 100,000+ training instances can appear to be undone by as few as 10 harmful examples. But the "lost" safety performance can be recovered by training on just 10 safety instances or even irrelevant tasks that never appeared in the original training. If the knowledge were truly forgotten, irrelevant retraining could not recover it.

The decomposition is simple: Task Performance = Task Alignment + Underlying Knowledge. What changes during continual learning is primarily the alignment component — the model's disposition to activate the right knowledge for the right task. The knowledge itself persists.

This reframes several alignment concerns. The vulnerability of safety training to "jailbreaking through fine-tuning" is not about erasing safety knowledge — it's about misaligning the activation pathway. The knowledge of what's safe and unsafe remains; the model simply stops applying it. This is recoverable, which is both reassuring (knowledge persists) and concerning (alignment is fragile).

The connection to Does RL teach reasoning or just when to use it? is precise: if RL teaches timing not capability, then "forgetting" after new training is timing disruption not capability loss. The mechanisms are parallel — activation alignment is what training modifies, and it's what continual learning disrupts.

The in-weights adaptation bottleneck as a forgetting cause. Fast-Slow Training names the structural reason alignment is so fragile: treating parameter updates as the sole adaptation mechanism forces every improvement — a reusable skill, a task heuristic, even a transient lesson from recent rollouts — to be written into the same persistent weights. Because the whole policy lives in those weights, any update that raises in-domain reward simultaneously drags the model away from base behavior, reducing entropy and disrupting the activation pathways this note shows are what actually carry "alignment." That reframes spurious forgetting as a misallocation: we route task-specific and transient lessons into weights that should be holding only persistent behavior, so the alignment component (activation disposition) is exactly what gets perturbed. FST's remedy — keep transient adaptation in an optimized textual context and let slow weights drift up to 70% less in KL — predicts less spurious forgetting precisely because it stops overwriting the activation alignment that knowledge persistence depends on. The recoverability finding here and FST's prevention strategy are two views of one mechanism: knowledge survives in the weights; what breaks (and what FST protects) is the model's disposition to activate it.

Inquiring lines that read this note 6

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do training priors constrain what context information can override?

What causes catastrophic forgetting during domain knowledge embedding?

Why does consolidated memory sometimes degrade agent performance?

Why does LLM memory consolidation regress below no-memory baselines?

What memory architectures best support persistent reasoning across extended interactions?

What gets lost when we describe memory as retrieval?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

How do neural networks separate factual knowledge from reasoning abilities?

What makes task alignment more fragile than underlying knowledge retention?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 156 in 2-hop network ·dense cluster Open in graph ↗

Is LLM forgetting really knowledge loss or align… Does RL teach reasoning or just when to use it? Why does reasoning training help math but hurt med… How much poisoned training data survives safety al… Does staying close to the base model preserve lear…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
timing thesis parallel: alignment is about activation not knowledge; forgetting is about timing disruption not knowledge erasure
Why does reasoning training help math but hurt medical tasks? Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
knowledge persistence in lower layers explains why alignment shifts in higher layers don't erase it
How much poisoned training data survives safety alignment? Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.
mirror finding: harmful knowledge also persists through alignment, just as beneficial knowledge persists through disruption
Does staying close to the base model preserve learning ability? Explores whether limiting how far training pushes a model from its base distribution (measured by KL divergence) helps it learn new tasks more effectively over time, and why that trade-off matters for continual learning.
extends: if forgetting is recoverable alignment loss, KL drift from base is the measurable indicator of how far that disruptive specialization has pushed the policy

Is LLM forgetting really knowledge loss or alignment loss?

Inquiring lines that read this note 6

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4