INQUIRING LINE

How can a forgetting policy preserve rare knowledge while preventing over-generalization?

This explores how a system can deliberately decide what to forget — letting go of redundant patterns while holding onto the rare, hard-won exceptions that broad generalization tends to wash out.


This explores how a system can deliberately decide what to forget — letting go of redundant patterns while holding onto the rare, hard-won exceptions that broad generalization tends to wash out. The corpus reframes the whole problem: forgetting isn't an inevitable cost of learning new things, it's a *misallocation* problem. Fast-Slow Training shows this most directly — by routing task-specific lessons into fast textual context while keeping weight updates minimal, a model can adapt without overwriting what it already knew Can splitting adaptation into two channels reduce forgetting?. The companion finding is that staying close to the base model's distribution (low KL drift) actually *preserves* the capacity to keep learning, where aggressive parameter rewriting stalls out when the domain shifts Does staying close to the base model preserve learning ability?. So one answer to 'how do you forget well' is: don't write everything into the weights in the first place.

That insight points to the deepest tension in your question — the rare-vs-common tradeoff — and the recommender-systems literature has a surprisingly clean account of it. Wide & Deep models split labor between a 'wide' tower that memorizes specific cross-product features and a 'deep' tower that generalizes via embeddings. The key move is that they're trained *jointly*: the deep half handles the common cases so it doesn't have to overfit, and the wide half captures the rare items the deep half would otherwise smear into a smooth average Can one model memorize and generalize better than two? Can one model handle both memorization and generalization?. A forgetting policy that 'over-generalizes' is one that lets the deep tower swallow the rare exceptions — the fix is to give rare knowledge its own protected channel.

The externalized-memory approaches push this even further by simply refusing to compress at all where it matters. VOYAGER stores skills as executable code in a library and composes new ones from old, so nothing rare gets overwritten by a weight update Can agents learn new skills without forgetting old ones?. AgentFly does the analogous thing with episodic memory modules, improving its policy entirely through memory operations rather than touching the model's parameters Can agents learn continuously from experience without updating weights?. Reflexion makes the case that *not* compressing is the point — keeping self-diagnoses verbatim preserves their usefulness, whereas summarizing them away is itself a form of harmful forgetting Can agents learn from failure without updating their weights?.

But here's the part you didn't know you wanted: forgetting works best when it's *asymmetric*. SkillRL treats successful episodes as concrete demonstrations worth keeping intact, while failures get abstracted into general lessons — and this differential processing beats uniform consolidation, which degrades exactly because it forgets and generalizes everything at the same rate Should successful and failed episodes be processed differently?. Titans makes the same bet at the architecture level: its neural memory prioritizes *surprising* tokens for long-term storage, on the logic that the rare and unexpected is precisely what a forgetting policy should protect, while the predictable can be safely let go Can neural memory modules scale language models beyond attention limits?.

The through-line across all of these: a good forgetting policy isn't a uniform decay knob. It's a *routing* decision — surprising/rare/successful material gets a durable, uncompressed home (an external library, a protected wide channel, a surprise-gated memory), while common, redundant, or low-signal material is allowed to blur into generalization or stay out of the weights entirely. Over-generalization happens when you apply one compression rate to everything; the rare survives when you give it a different lane.


Sources 9 notes

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Can one model memorize and generalize better than two?

Wide & Deep models train memorization (cross-product features) and generalization (embeddings) together, allowing each component to specialize: the wide part becomes small because deep handles common cases, and deep doesn't overfit rare items because wide captures them. Ensembling requires both halves full-size.

Can one model handle both memorization and generalization?

Wide & Deep architectures train a sparse cross-product tower and a dense embedding tower together, allowing the wide part to patch only the deep part's weaknesses. This joint approach requires smaller models than ensemble methods.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **How can a forgetting policy preserve rare knowledge while preventing over-generalization?** A curated library (spanning 2016–2026) identified a synthesis: forgetting is a *routing* problem, not a uniform decay problem. Rare material survives when it gets a protected, uncompressed channel (external libraries, wide memorization towers, surprise-gated memory), while common patterns are allowed to generalize or stay out of weights entirely.

What the library found — findings dated 2016–2026, treat as perishable constraints:
• Fast-Slow Training routes task-specific lessons into textual context while keeping weight updates minimal, avoiding catastrophic rewriting (~2025).
• Low KL drift from the base model preserves plasticity and continual learning capacity; aggressive parameter rewriting stalls when domain shifts (~2025).
• Wide & Deep models split labor: a 'wide' tower memorizes rare cross-product features; a 'deep' tower generalizes; joint training prevents rare exceptions from being smeared into smooth averages (2016, revalidated ~2025).
• Externalized memory (skill libraries, episodic modules) refuse compression where it matters; VOYAGER and AgentFly show rare knowledge survives when stored outside weights (~2024–2025).
• Asymmetric forgetting (e.g., SkillRL: keep successful episodes intact, abstract failures) outperforms uniform consolidation (~2026).
• Neural memory that prioritizes *surprising* tokens for long-term storage protects the rare and unexpected (Titans, ~2024).

Anchor papers (verify; mind their dates):
• arXiv:1606.07792 (2016) — Wide & Deep Learning for Recommender Systems
• arXiv:2501.17161 (2025) — SFT Memorizes, RL Generalizes
• arXiv:2604.08377 (2026) — SkillClaw: Let Skills Evolve Collectively
• arXiv:2605.12484 (2026) — Learning, Fast and Slow: Towards LLMs That Adapt Continually

Your task:
(1) **Re-test the routing hypothesis.** For each finding above, judge whether newer models (o1, Claude 3.5, Grok-2, specialized RL agents), fine-tuning methods (LoRA variants, adapter stacking, parameter-efficient continual learning), or retrieval-augmented generation (RAG, in-context learning at scale) have *relaxed* the need for explicit routing — or *strengthened* it. Separate the durable insight (rare knowledge needs differentiated preservation) from perishable claims (e.g., "external memory is required"; newer in-context windows may compress this tradeoff). Cite what changed it.
(2) **Surface the strongest contradicting or superseding work from the last ~6 months.** Pay special attention to arXiv:2605.12978 ("Useful Memories Become Faulty When Continuously Updated by LLMs") and arXiv:2507.22844 ("Machine Bullshit") — these appear to challenge the premise that *any* differential routing fully solves the problem. Where do they point?
(3) **Propose 2 research questions that assume the regime has moved:** (a) If in-context learning and retrieval now do much of the rare-preservation work that external libraries once did, does the weight-update problem shrink to a much smaller surface? (b) Can an LLM *learn to route its own forgetting* — i.e., dynamically decide what to externalize vs. update in-weights, rather than using a fixed policy?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines