SYNTHESIS NOTE

Do networks recover from forgetting before re-encountering documents?

When language models train cyclically on repeated documents, do they anticipate upcoming material and recover from forgetting in advance? This challenges the standard catastrophic-interference narrative about sequential training.

Synthesis note · 2026-06-03 · sourced from Knowledge Graphs

The default story of sequential training is catastrophic interference: forgetting increases monotonically as a network trains on a sequence of different documents. This paper studies a structured non-IID setting — documents presented cyclically in a fixed, repeated order — and finds a remarkable opposite phenomenon: anticipatory recovery. Networks recover from the forgetting of a document before they encounter it again in the cycle, as if pre-positioning themselves for what's coming. The effect emerges and becomes more robust as the model scales up parameters, and only when each document is well-fitted before moving on; visualizations of weights, activations, and gradients show clear temporal structure.

The keeper is that over-parameterized networks in structured, repeating environments behave unlike the catastrophic-interference picture — they exploit the temporal regularity of the training schedule to organize their weights anticipatorily. This is closer to how humans learn from structured, repeating material than the random-sampling default of LLM pretraining.

This adds a training-dynamics surprise to the vault. It connects to the broader theme that structure in the learning process matters, alongside Does teaching question patterns before document training improve knowledge access? (order of encoding shapes outcomes) and Is LLM forgetting really knowledge loss or alignment loss? (forgetting is often recoverable, not destruction) — both complicate the simple catastrophic-forgetting narrative.

Inquiring lines that read this note 9

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What memory architectures best support persistent reasoning across extended interactions?

How do training priors constrain what context information can override?

How does training order affect knowledge acquisition in language models?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

What limits mechanistic interpretability's ability to characterize models?

How do weight visualizations reveal temporal structure in cyclic training?

Why does finetuning cause catastrophic forgetting of model capabilities?

Can we unlearn memorized text by finetuning only high-gradient weights?

How does memorization interact with learning and generalization?

Can document repetition accidentally memorize sensitive information instead of learning?

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 127 in 2-hop network ·dense cluster Open in graph ↗

Do networks recover from forgetting before re-en… Does teaching question patterns before document tr… Is LLM forgetting really knowledge loss or alignme…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does teaching question patterns before document training improve knowledge access? Standard LLM training encodes documents first, then teaches QA patterns. But does this order matter? Exploring whether reversing the sequence—teaching how knowledge gets queried before encoding it—could unlock better factual recall.
both show training *structure/order* shapes what the network learns and retains
Is LLM forgetting really knowledge loss or alignment loss? When language models appear to forget old knowledge after learning new tasks, is the underlying knowledge actually gone, or has the model simply lost the ability to activate it? This distinction matters for understanding how fragile safety training really is.
both complicate the catastrophic-forgetting narrative; forgetting is structured and often recoverable

Do networks recover from forgetting before re-encountering documents?

Inquiring lines that read this note 9

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4