INQUIRING LINE

Inquiring lines›How do language models construct a…›How does AI persuasion undermine h…›Why do continual learning scenario…›this inquiring line

Should AI store new lessons in a separate memory bank, or permanently rewrite its own brain to absorb them?

Do long-term memory modules outperform consolidation into fast weights?

This explores whether keeping what a model learns in a separate, retrievable memory module beats folding that experience back into the model's weights (fine-tuning or 'consolidation').

This explores whether long-term memory works better as a separate module you read from, or as something baked into the model's weights through training — and the corpus leans, with caveats, toward keeping memory outside the weights. The strongest pattern across these notes is that agents can learn continuously without touching their parameters at all. Reflexion has agents write verbal self-diagnoses into episodic memory after each failure, improving across episodes with zero weight updates — and notably, keeping those reflections uncompressed preserves their usefulness Can agents learn from failure without updating their weights?. AgentFly pushes this further, formalizing the whole learning loop as memory operations (case, subtask, tool modules) and hitting 87.88% on GAIA without modifying a single parameter Can agents learn continuously from experience without updating weights?.

The case against consolidation gets sharper when you look at what folding things into weights actually costs. Proxy-tuning shows that direct fine-tuning corrupts knowledge stored in a model's lower layers — whereas leaving the base weights untouched and steering at decoding time recovers 88–91% of the alignment benefit while preserving what the model knew Can decoding-time tuning preserve knowledge better than weight fine-tuning?. In other words, 'consolidation into weights' is not free: it can overwrite the very knowledge you wanted to keep. That's the central tension the question is poking at.

But the answer isn't a clean win for modules. The most interesting note here is the failure mode. COMEDY tries to replace retrieval entirely by having one model continuously compress conversation history into its own running memory — and that continuous reprocessing follows an inverted-U: past a point it degrades *below* having no memory at all, through misgrouping, context loss, and overfitting Can a single model replace retrieval for long-term conversation memory?. So aggressive consolidation, even outside the weights, can actively hurt. The mechanism that makes Reflexion work — keep it uncompressed — is exactly what COMEDY violates.

The more sophisticated framing the corpus offers is that this may be a false binary. Titans architecturally *separates* the two: fast attention as short-term memory and a neural memory module that learns at inference time to store surprising tokens, scaling past 2M tokens without the quadratic cost Can neural memory modules scale language models beyond attention limits?. Latent-Thought models go further with explicit dual-rate learning — fast local variational updates coupled to slow global decoder learning — treating fast and slow memory as complementary scaling dimensions rather than competitors Can latent thought vectors scale language models beyond parameters?.

So the thing you didn't know you wanted to know: the live research question isn't 'modules vs. fast weights' but *what deserves to be consolidated and what should stay retrievable*. Memory modules win on continual learning and knowledge preservation; consolidation wins on compression and speed — until the compression itself becomes the failure. The best systems are building the fast/slow boundary in on purpose.

Sources 6 notes

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can a single model replace retrieval for long-term conversation memory?

COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Show all 6 sources

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs1.77 match · arxiv ↗
Useful Memories Become Faulty When Continuously Updated by LLMs1.77 match · arxiv ↗
Scalable Language Models with Posterior Inference of Latent Thought Vectors0.93 match · arxiv ↗
Titans: Learning to Memorize at Test Time0.91 match · arxiv ↗
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory0.90 match · arxiv ↗
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning0.89 match · arxiv ↗
Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations0.88 match · arxiv ↗
Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments0.88 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether long-term memory modules outperform consolidation into fast weights in large language models. This question spans 2023–2026 research; treat the findings below as dated claims to be re-validated, not current truth.

What a curated library found — and when (findings span 2023–2026, marked where tied to specific papers):
- Agents learn continuously without weight updates via episodic memory; Reflexion shows uncompressed verbal reflection preserves usefulness across episodes (~2023–24).
- Memory-based learning (case, subtask, tool modules) hits 87.88% on GAIA without parameter modification; AgentFly formalizes the loop (~2024).
- Direct consolidation corrupts lower-layer knowledge; proxy-tuning at decoding time recovers 88–91% alignment benefit while preserving pretrained knowledge (~2024).
- Aggressive consolidation (even outside weights) fails: COMEDY's continuous compression degrades below zero-memory baseline through misgrouping and overfitting (~2024).
- Dual fast/slow boundaries work: Titans uses fast attention + learned neural memory module scaling past 2M tokens; Latent-Thought models use fast local variational updates paired with slow global decoder learning (~2024–25).

Anchor papers (verify; mind their dates):
- arXiv:2501.00663 (Titans, 2024-12): neural memory at inference time.
- arXiv:2502.01567 (Latent-Thought, 2025-02): explicit dual-rate learning.
- arXiv:2402.11975 (Compress to Impress, 2024-02): continual memory trade-offs.
- arXiv:2605.12978 (Useful Memories Become Faulty, 2026-05): continuous update failures.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, Gemini 2.5, Claude 4), methods (in-context adaptation, chain-of-thought scaling), or training (RL post-training, test-time optimization) have since RELAXED or OVERTURNED it. Separate the durable question (what *should* be consolidated vs. retrieved?) from perishable limitations (does compression still degrade?); cite what resolved or confirmed each.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially anything showing consolidation wins under new conditions or uncompressed retrieval breaks at scale.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Does test-time scaling of memory read-heads make consolidation unnecessary?" or "Can selective consolidation (lossy for low-salience tokens, lossless for surprising ones) match both compression and fidelity?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Should AI store new lessons in a separate memory bank, or permanently rewrite its own brain to absorb them?

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8