Do long-term memory modules outperform consolidation into fast weights?
This explores whether keeping what a model learns in a separate, retrievable memory module beats folding that experience back into the model's weights (fine-tuning or 'consolidation').
This explores whether long-term memory works better as a separate module you read from, or as something baked into the model's weights through training — and the corpus leans, with caveats, toward keeping memory outside the weights. The strongest pattern across these notes is that agents can learn continuously without touching their parameters at all. Reflexion has agents write verbal self-diagnoses into episodic memory after each failure, improving across episodes with zero weight updates — and notably, keeping those reflections uncompressed preserves their usefulness Can agents learn from failure without updating their weights?. AgentFly pushes this further, formalizing the whole learning loop as memory operations (case, subtask, tool modules) and hitting 87.88% on GAIA without modifying a single parameter Can agents learn continuously from experience without updating weights?.
The case against consolidation gets sharper when you look at what folding things into weights actually costs. Proxy-tuning shows that direct fine-tuning corrupts knowledge stored in a model's lower layers — whereas leaving the base weights untouched and steering at decoding time recovers 88–91% of the alignment benefit while preserving what the model knew Can decoding-time tuning preserve knowledge better than weight fine-tuning?. In other words, 'consolidation into weights' is not free: it can overwrite the very knowledge you wanted to keep. That's the central tension the question is poking at.
But the answer isn't a clean win for modules. The most interesting note here is the failure mode. COMEDY tries to replace retrieval entirely by having one model continuously compress conversation history into its own running memory — and that continuous reprocessing follows an inverted-U: past a point it degrades *below* having no memory at all, through misgrouping, context loss, and overfitting Can a single model replace retrieval for long-term conversation memory?. So aggressive consolidation, even outside the weights, can actively hurt. The mechanism that makes Reflexion work — keep it uncompressed — is exactly what COMEDY violates.
The more sophisticated framing the corpus offers is that this may be a false binary. Titans architecturally *separates* the two: fast attention as short-term memory and a neural memory module that learns at inference time to store surprising tokens, scaling past 2M tokens without the quadratic cost Can neural memory modules scale language models beyond attention limits?. Latent-Thought models go further with explicit dual-rate learning — fast local variational updates coupled to slow global decoder learning — treating fast and slow memory as complementary scaling dimensions rather than competitors Can latent thought vectors scale language models beyond parameters?.
So the thing you didn't know you wanted to know: the live research question isn't 'modules vs. fast weights' but *what deserves to be consolidated and what should stay retrievable*. Memory modules win on continual learning and knowledge preservation; consolidation wins on compression and speed — until the compression itself becomes the failure. The best systems are building the fast/slow boundary in on purpose.
Sources 6 notes
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.