INQUIRING LINE

How does dual-rate learning separate episodic and procedural memory in neural networks?

This explores how 'dual-rate' learning — pairing a fast-changing memory channel with a slow-changing one — lets a network keep specific experiences (episodic) separate from general skills (procedural), and why that separation matters.


This explores how 'dual-rate' learning — running one fast learner alongside one slow learner — keeps specific experiences apart from general know-how. The cleanest example in the corpus is Latent-Thought Language Models, which couple fast local variational learning (per-input 'thought' vectors that adapt quickly) with slow global decoder learning (the shared weights that change gradually). The fast channel captures what's specific to the moment; the slow channel accumulates what generalizes — and because they update at different speeds, the model gains scaling dimensions that don't depend on parameter count alone Can latent thought vectors scale language models beyond parameters?.

The reason this two-speed split is worth the trouble becomes obvious once you look at what happens without it. When a single set of weights has to absorb both fast specifics and slow generalities, you get catastrophic forgetting — new learning overwrites old. 'Fast-Slow Training' reframes that as a misallocation problem: route the task-specific lessons into fast textual context (prompts) and keep the slow parameter updates minimal, and forgetting largely disappears while training runs 1.4–3x faster Can splitting adaptation into two channels reduce forgetting?. Same principle, different substrate: separate the rates and you stop the fast stuff from corrupting the slow stuff.

What makes this more than an engineering trick is that the brain appears to do exactly this, and the corpus draws the map explicitly. The Complementary Learning Systems framing lines up transformer weights with the slow-consolidating neocortex (procedural, distributed knowledge), retrieval/RAG stores with the fast-encoding hippocampus (episodic, rapid capture), and agentic state with prefrontal control Can brain memory systems explain how LLMs should store knowledge?. Dual-rate learning is the computational echo of having a fast 'remember this episode' system feeding a slow 'distill the pattern' system. Titans makes the architectural version concrete — short-term attention plus a separate long-term neural memory that selectively stores surprising tokens Can neural memory modules scale language models beyond attention limits?.

The episodic-vs-procedural distinction the question names also shows up cleanly in how agents learn. Reflexion stores verbal self-reflections as episodic memory — concrete records of specific trials — and improves without ever touching the weights Can agents learn from failure without updating their weights?. Meanwhile, procedural knowledge turns out to be the thing that actually generalizes reasoning: analysis of millions of pretraining documents shows reasoning draws on broad, transferable procedures, while factual recall leans on narrow, document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. That's the two memory types wearing different clothes — episodes are concrete and local, procedures are abstract and shared.

The most striking wrinkle is that the separation works best when the two streams are treated *asymmetrically*, not just at different speeds. SkillRL keeps successful episodes as concrete demonstrations but compresses failures into abstracted lessons — the win is stored episodically, the loss is stored procedurally — and that asymmetry beats uniform consolidation Should successful and failed episodes be processed differently?. And RL training itself moves through the two registers in sequence: a first phase that masters procedural execution, then a second where strategic planning becomes the bottleneck Does RL training follow a predictable two-phase learning sequence?. So 'dual-rate' isn't only about speed — it's about giving specifics and generalities different jobs, different storage, and sometimes different processing entirely.


Sources 8 notes

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can brain memory systems explain how LLMs should store knowledge?

Research shows transformer weights function as a distributed neocortex for consolidated knowledge, RAG stores as hippocampal indexing for rapid encoding, and agentic state as prefrontal executive control. The CLS framework predicts why hybrid systems outperform single-tier approaches and identifies missing consolidation mechanisms that prevent memory integration.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI research analyst. The question remains open: How do neural networks separate episodic (specific, transient) from procedural (general, durable) memory, and can we engineer that separation deliberately?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as a snapshot, not current ground truth.
- Dual-rate learning (fast local + slow global updates) routes task-specific lessons into one channel and parameter consolidation into another, achieving 1.4–3× faster training while avoiding catastrophic forgetting (~2025, arXiv:2502.01567).
- Complementary Learning Systems maps transformer weights to slow neocortex (procedural, distributed), retrieval stores to fast hippocampus (episodic, rapid), and agentic state to prefrontal control (~2026, arXiv:2601.09113).
- Procedural knowledge in pretraining drives reasoning generalization; factual recall remains document-specific and narrow (~2024, arXiv:2411.12580).
- Asymmetric consolidation—storing wins episodically, losses as abstracted lessons—outperforms uniform dual-rate learning (~2025, RL post-training work).
- RL training exhibits two-phase dynamics: procedural execution first, then strategic planning bottleneck (~2025, arXiv:2507.22844).

Anchor papers (verify; mind their dates):
- arXiv:2502.01567 (Latent Thought; 2025-02)
- arXiv:2411.12580 (Procedural Knowledge; 2024-11)
- arXiv:2601.09113 (AI Hippocampus; 2026-01)
- arXiv:2605.12484 (Learning Fast and Slow; 2026-05)

Your task:
(1) RE-TEST EACH CONSTRAINT. Does newer training (constitutional AI, process reward models, chain-of-thought scaling) relax the forgetting problem or the speed–stability tradeoff? Check whether meta-learning, adapter modules, or mixture-of-experts have superseded the dual-rate framing. Separate the durable question (how to engineer memory separation) from perishable claims (speed gains, specific parameter ratios).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months—especially any showing unified learning surpasses dual-rate, or any that questions the hippocampus–neocortex analogy in transformers.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Under continual pretraining, does the episodic–procedural split remain stable or blur? (b) Can memory separation emerge *without* architectural dualism—via purely algorithmic routing?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines