Can splitting adaptation into two channels reduce forgetting?
When language models adapt to new tasks, does separating task-specific learning (via prompt context) from persistent parameter updates help preserve both generalization ability and the model's original capabilities?
Treating parameter updates as the sole mechanism of adaptation creates a bottleneck: every improvement — a reusable reasoning skill, a task heuristic, even a transient lesson from recent rollouts — has to be written into the same persistent weights. Because the whole policy lives in those weights, any update that raises in-domain reward simultaneously drags the model away from its base behavior, reducing entropy, hurting out-of-distribution generalization, and eroding the model's ability to adapt to future tasks (plasticity loss).
Fast-Slow Training resolves this by refusing to make weights carry everything. It splits adaptation into a slow parametric component (model weights, expensive to update, persisting long-lived behavior) and a fast textual component (prompts, instructions, task context, optimized via reflective prompt evolution with GEPA). The fast channel absorbs task-specific and rapidly-changing information from textual feedback; the slow channel consolidates only persistent behavior and stays closer to the base model. Interleaving the two — RL updates plus context optimization — reaches matched performance with 1.4–3x fewer optimizer steps and a higher asymptote, while leaving the model far closer to its origin.
Why it matters: it reframes catastrophic forgetting as a misallocation problem rather than an inherent cost of learning. Forgetting happens because we force weights to store things that did not belong in weights. Route the transient and task-specific into context, and the weights stay general — so there is less to forget. This is a division-of-labor argument: the two channels operate at different timescales (an echo of System 1 vs System 2) and each does what it is suited for. The counterpoint is that the fast channel's capacity is bounded by context length and prompt-optimization quality, so genuinely large bodies of new knowledge still have to land in weights eventually.
Inquiring lines that use this note as a source 51
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does fine-tuning for continuous space cause catastrophic forgetting?
- Can continuum memory systems prevent catastrophic forgetting in neural networks?
- How should memory consolidation timing differ across multiple timescales?
- What makes multimodal conditioning effective when features are decomposed to the right granularity?
- Does narrow reallocation to remaining tasks constitute genuine adaptation?
- Can self-distillation reduce catastrophic forgetting in continual learning?
- How does prompt optimization differ from building persistent activation context?
- Do task-relevant parameter changes naturally concentrate in sparse regions?
- Why does full multi-task fine-tuning perform worse than sequential training?
- Can dynamic instance-specific prompt selection solve the generalization problem across tasks?
- How much does memorization capacity limit a model's ability to learn new information?
- What access constraints allow description-based adaptation but block conventional techniques?
- How does memorization capacity saturation trigger the grokking transition?
- Why does context information fail to override prior training associations?
- Why does fine-tuning change how models process retrieved context?
- How much can mitigation techniques like augmentation reduce priming without harming learning?
- How does dual-rate learning separate episodic and procedural memory in neural networks?
- Can fast-slow separation improve both memory and generation in language models?
- Can context windows and RAG actually change what language models generate?
- How do neural memory modules extend context length beyond attention limits?
- How does explicit exploratory prompting compare to fine-tuned reinforcement learning for in-context adaptation?
- When should full-parameter post-training be used instead of LoRA adaptation?
- Can episodic memory alone enable learning without parameter updates?
- How do layer-wise versus parameter-wise merging strategies affect information retention?
- How do retention gates regularize forgetting across different sequence model architectures?
- How does the functional separation of knowledge and reasoning affect adaptation methods?
- What deployment tradeoffs emerge between single-pass and multi-pass inference adaptation?
- How does pretrained knowledge constrain what adaptation strategies can achieve?
- Does parameter isolation per task enable online updates without retraining?
- How do prompting and activation steering relate as compression strategies?
- Can the same description-then-retrieve pattern work for domain adaptation without target data?
- How does context budget create tradeoffs between memory and skills?
- Can AI models retain knowledge across changing environments without catastrophic forgetting?
- What distinguishes data that generalizes broadly from task-specific memorization?
- What mechanism transfers explicit memories into parametric model weights?
- How do training associations override context information in language models?
- Can training on diverse related tasks be more efficient than task-specific training?
- What gets lost when we describe memory as retrieval?
- Why does specializing to one task make future task learning harder?
- How does KL regularization prevent both forgetting and adaptation loss?
- Can memory-based adaptation and gradient fine-tuning operate on complementary timescales?
- Why does uniform memory consolidation sometimes degrade below the no-memory baseline?
- Can auxiliary modules preserve reasoning without catastrophic forgetting?
- What limits the capacity of context-based fast adaptation channels?
- How does in-weights adaptation create spurious forgetting in models?
- How can a forgetting policy preserve rare knowledge while preventing over-generalization?
- How do adaptive memory modules compare to feedback-based working memory for long context?
- Can decoding-time prompting strategies fully replace diversity-focused training methods?
- Is forgetting in language models reversible or permanent knowledge loss?
- Can adaptive memory modules combine long-term filtering with short-term attention benefits?
- Why does adaptation concentrate in low-dimensional subspaces of weights or representations?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can prompt optimization teach models knowledge they lack?
Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.
the boundary condition on the fast channel: context optimization activates and steers but cannot store genuinely new knowledge, which is why slow weights remain necessary
-
Does prompt optimization without inference strategy fail?
Standard practice optimizes prompts and inference strategies separately. But do prompts optimized for single-shot evaluation actually perform worse when deployed at scale with aggregation methods like majority voting?
reinforces the interleaving design: fast (prompt) and the other channel must be co-optimized, not optimized separately
-
Can agents adapt without pausing service to users?
Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.
same fast/slow dual-timescale architecture in the agent setting; convergent design from a different angle
-
Can continuous reasoning avoid forgetting in instruction-tuned models?
Full fine-tuning for continuous-space reasoning degrades performance in capable instruction-tuned models. Why does this happen, and can architectural changes prevent it?
alternative forgetting-avoidance strategy: offload to an auxiliary module rather than to textual context, but the same principle of keeping the base weights untouched
-
Can agents learn new skills without forgetting old ones?
Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.
another non-weight store for accumulating skills, supporting the general claim that adaptation should not all flow through parameters
-
Does staying close to the base model preserve learning ability?
Explores whether limiting how far training pushes a model from its base distribution (measured by KL divergence) helps it learn new tasks more effectively over time, and why that trade-off matters for continual learning.
grounds: the mechanism behind the slow channel's payoff — keeping weights near the base (low KL drift) is precisely what preserves plasticity and reduces forgetting
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Learning, Fast and Slow: Towards LLMs That Adapt Continually
- Spurious Forgetting in Continual Learning of Language Models
- A Survey on Post-training of Large Language Models
- Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
- SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs
- AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
- Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
- The AI Hippocampus: How Far are We From Human Memory?
Original note title
splitting adaptation into slow weights and fast textual context avoids catastrophic forgetting and plasticity loss