SYNTHESIS NOTE

Can splitting adaptation into two channels reduce forgetting?

When language models adapt to new tasks, does separating task-specific learning (via prompt context) from persistent parameter updates help preserve both generalization ability and the model's original capabilities?

Synthesis note · 2026-05-28 · sourced from Training Fine Tuning

Treating parameter updates as the sole mechanism of adaptation creates a bottleneck: every improvement — a reusable reasoning skill, a task heuristic, even a transient lesson from recent rollouts — has to be written into the same persistent weights. Because the whole policy lives in those weights, any update that raises in-domain reward simultaneously drags the model away from its base behavior, reducing entropy, hurting out-of-distribution generalization, and eroding the model's ability to adapt to future tasks (plasticity loss).

Fast-Slow Training resolves this by refusing to make weights carry everything. It splits adaptation into a slow parametric component (model weights, expensive to update, persisting long-lived behavior) and a fast textual component (prompts, instructions, task context, optimized via reflective prompt evolution with GEPA). The fast channel absorbs task-specific and rapidly-changing information from textual feedback; the slow channel consolidates only persistent behavior and stays closer to the base model. Interleaving the two — RL updates plus context optimization — reaches matched performance with 1.4–3x fewer optimizer steps and a higher asymptote, while leaving the model far closer to its origin.

Why it matters: it reframes catastrophic forgetting as a misallocation problem rather than an inherent cost of learning. Forgetting happens because we force weights to store things that did not belong in weights. Route the transient and task-specific into context, and the weights stay general — so there is less to forget. This is a division-of-labor argument: the two channels operate at different timescales (an echo of System 1 vs System 2) and each does what it is suited for. The counterpoint is that the fast channel's capacity is bounded by context length and prompt-optimization quality, so genuinely large bodies of new knowledge still have to land in weights eventually.

Inquiring lines that read this note 53

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do continual learning scenarios trigger catastrophic forgetting and interference?

What memory architectures best support persistent reasoning across extended interactions?

What articulatory information do speech signals carry that text cannot?

What makes multimodal conditioning effective when features are decomposed to the right granularity?

How does AI adoption affect human skill development and labor equality?

Does narrow reallocation to remaining tasks constitute genuine adaptation?

Can prompting inject entirely new knowledge into language models?

How does sequence length affect sparsity tolerance in models?

Do task-relevant parameter changes naturally concentrate in sparse regions?

What determines success in training models on multiple tasks?

How should inference compute be adaptively allocated based on prompt difficulty?

How does memorization interact with learning and generalization?

How much does memorization capacity limit a model's ability to learn new information?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

How do training priors constrain what context information can override?

Do language models learn genuine linguistic structure or just surface patterns?

Can fast-slow separation improve both memory and generation in language models?

When should retrieval-augmented systems decide to fetch new information?

Can context windows and RAG actually change what language models generate?

How do transformer attention mechanisms implement memory and algorithmic functions?

How do neural memory modules extend context length beyond attention limits?

Why does finetuning cause catastrophic forgetting of model capabilities?

How should retrieval systems optimize for multi-step reasoning during inference?

Can the same description-then-retrieve pattern work for domain adaptation without target data?

Why does consolidated memory sometimes degrade agent performance?

Why does uniform memory consolidation sometimes degrade below the no-memory baseline?

Do base models contain latent reasoning that training can unlock?

Can auxiliary modules preserve reasoning without catastrophic forgetting?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

Can decoding-time prompting strategies fully replace diversity-focused training methods?

Which computational strategies best support reasoning in language models?

Can a trained decoder replace both search and parameter updates?

Can next-token prediction alone produce genuine language understanding?

What makes token selection more important than adaptation strategy?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 140 in 2-hop network ·medium cluster Open in graph ↗

Can splitting adaptation into two channels reduc… Can prompt optimization teach models knowledge the… Does prompt optimization without inference strateg… Can agents adapt without pausing service to users? Can continuous reasoning avoid forgetting in instr… Can agents learn new skills without forgetting old… Does staying close to the base model preserve lear…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can prompt optimization teach models knowledge they lack? Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.
the boundary condition on the fast channel: context optimization activates and steers but cannot store genuinely new knowledge, which is why slow weights remain necessary
Does prompt optimization without inference strategy fail? Standard practice optimizes prompts and inference strategies separately. But do prompts optimized for single-shot evaluation actually perform worse when deployed at scale with aggregation methods like majority voting?
reinforces the interleaving design: fast (prompt) and the other channel must be co-optimized, not optimized separately
Can agents adapt without pausing service to users? Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.
same fast/slow dual-timescale architecture in the agent setting; convergent design from a different angle
Can continuous reasoning avoid forgetting in instruction-tuned models? Full fine-tuning for continuous-space reasoning degrades performance in capable instruction-tuned models. Why does this happen, and can architectural changes prevent it?
alternative forgetting-avoidance strategy: offload to an auxiliary module rather than to textual context, but the same principle of keeping the base weights untouched
Can agents learn new skills without forgetting old ones? Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.
another non-weight store for accumulating skills, supporting the general claim that adaptation should not all flow through parameters
Does staying close to the base model preserve learning ability? Explores whether limiting how far training pushes a model from its base distribution (measured by KL divergence) helps it learn new tasks more effectively over time, and why that trade-off matters for continual learning.
grounds: the mechanism behind the slow channel's payoff — keeping weights near the base (low KL drift) is precisely what preserves plasticity and reduces forgetting

Can splitting adaptation into two channels reduce forgetting?

Inquiring lines that read this note 53

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 5