INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Why does finetuning cause catastro…›this inquiring line

The more you try to hardwire knowledge into an AI, the more it damages what the model already knows.

What mechanism transfers explicit memories into parametric model weights?

This explores how knowledge stored as explicit, readable memory (text, episodes, retrieved context) gets baked into a model's internal weights — and the corpus's surprising answer is mostly about why you often shouldn't, and what breaks when you try.

This explores the mechanism that moves explicit memories — things stored as readable text, episodes, or retrieved context — into the model's parametric weights. The honest synthesis is that the collection circles this question from the opposite direction: most of its strongest work argues that the transfer into weights is the lossy, fragile step, and that keeping memory explicit often works better. So the most useful thing to learn here isn't a clean 'memory-to-weights' pipeline — it's why that pipeline tends to corrupt what it absorbs.

Start with what goes wrong when you do force explicit knowledge into weights. Direct fine-tuning, the obvious mechanism, turns out to damage the model's existing knowledge storage in its lower layers — which is why decoding-time proxy-tuning, which never touches the base weights and instead nudges the output distribution, actually beats fine-tuning on knowledge tasks Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The same lesson shows up as 'catastrophic forgetting': writing new lessons into weights overwrites old ones. Several notes treat this as a misallocation problem rather than an inherent cost — route the fast, task-specific stuff into text/prompts and let the slow weights barely move Can splitting adaptation into two channels reduce forgetting?, or freeze the backbone entirely and bolt on a small auxiliary model Can continuous reasoning avoid forgetting in instruction-tuned models?.

If you do want the transfer to be clean, the corpus hints at where the real mechanism lives: not the whole network, but a sparse, structured subnetwork. Reinforcement learning, it turns out, only updates 5–30% of parameters, and those updates are nearly full-rank and nearly identical across random seeds — meaning the model has structural 'slots' where new behavior gets written, not arbitrary smearing across all weights Does reinforcement learning update only a small fraction of parameters?. Push that further and you can deliberately isolate task-specific parameter regions and freeze them so new learning doesn't trample old Can isolating task-specific parameters prevent multi-task fine-tuning interference?, or train with sparse weights from the start so knowledge lands in interpretable, modular circuits Can sparse weight training make neural networks interpretable by design?. That's the closest thing to an actual mechanism: explicit knowledge consolidates into a small, localized set of weights, not the whole model.

But the most interesting counter-current is that a whole branch of the collection refuses the transfer altogether. Agents learn continuously by writing experience into an external memory store and never updating a single weight — episodic reflections from trial and error Can agents learn from failure without updating their weights?, formal memory-augmented reinforcement learning that hit 87.88% on a hard benchmark with frozen parameters Can agents learn continuously from experience without updating weights?, and causal-structured memory that even transfers to new environments because the memory's shape carries the applicability conditions weights would lose Can frozen language models continually improve through memory structure alone?. And architectures like Titans build a separate neural-memory module that compresses 'surprising' tokens into long-term storage alongside attention — a learned mechanism for deciding what's worth keeping, sitting next to the weights rather than dissolved into them Can neural memory modules scale language models beyond attention limits?.

The thing you didn't know you wanted to know: there may be no faithful mechanism for transferring explicit memory into weights, because identical model behavior can hide radically different internal structures, and gains in one property reliably degrade another like faithfulness or calibration What really happens inside a language model?. The corpus's working answer is that the cleanest 'memory' is often the memory you leave explicit.

Sources 11 notes

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Show all 11 sources

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can frozen language models continually improve through memory structure alone?

Agents using causal-form memory (preserving applicability conditions) outperform generic reflection by 23 points on repeated trials and gain 4-17 points transferring to new environments, showing memory shape matters more than parameter updates.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

What really happens inside a language model?

Research into mechanistic interpretability, cognitive models, and training dynamics shows that identical benchmark performance conceals radically different internal structures. Improving one capability (helpfulness, accuracy) reliably degrades others (faithfulness, calibration, diversity).

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs3.44 match · arxiv ↗
Useful Memories Become Faulty When Continuously Updated by LLMs2.64 match · arxiv ↗
The AI Hippocampus: How Far are We From Human Memory?2.53 match · arxiv ↗
CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization1.76 match · arxiv ↗
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning1.73 match · arxiv ↗
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs1.72 match · arxiv ↗
Rethinking Memory as Continuously Evolving Connectivity1.72 match · arxiv ↗
Are We Ready For An Agent-Native Memory System?1.71 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about memory consolidation in LLMs. The question remains open: what mechanism (if any) faithfully transfers explicit memories into parametric weights?

What a curated library found — and when (findings span 2023–2026; these are dated claims, not current truth):
• Direct fine-tuning damages lower-layer knowledge; decoding-time proxy-tuning avoids weight updates entirely and outperforms on knowledge tasks (2024–2025).
• RL updates only 5–30% of parameters in sparse, nearly full-rank subnetworks across random seeds — suggesting structural 'slots' for new behavior rather than distributed smearing (2025).
• Freezing backbones and bolting auxiliary modules, or isolating core parameters and blocking cross-task interference, prevent catastrophic forgetting (2025–2026).
• Memory-augmented agents that write experience to external stores and never update weights hit 87.88% on hard benchmarks; causal-structured memory transfers to new environments because structure lives outside weights (2024–2026).
• Neural memory modules (e.g., Titans) compress surprising tokens separately from attention, embodying a learned mechanism for what to memorize without dissolving it into weights (2024).

Anchor papers (verify; mind their dates):
• arXiv:2505.11711 (2025): RL finetunes sparse subnetworks
• arXiv:2508.21741 (2025): Parameter isolation prevents interference
• arXiv:2511.13653 (2026): Weight sparsity yields interpretable circuits
• arXiv:2605.12978 (2026): Continuously updated memories become faulty

Your task:
(1) RE-TEST EACH CONSTRAINT. For proxy-tuning, parameter sparsity, and frozen-backbone + auxiliary-module designs: do newer training methods, architectural innovations (e.g., mixture-of-experts, dynamic routing), or post-hoc editing tools now *enable* weight updates without knowledge loss? Where does catastrophic forgetting still occur? Separate the durable question (how to consolidate without erasing) from perishable limitations (e.g., fine-tuning tech from 2025 may be outmoded).
(2) Surface the strongest CONTRADICTING work from the last 6 months—any paper showing faithful weight-based consolidation *does* work, or that external memory architectures fail where weights succeed.
(3) Propose 2 questions assuming the regime has shifted: (a) if parameter isolation is now standard, what organizes which parameters learn which memories? (b) if frozen-weight + episodic memory is now dominant, how do we audit whether the explicit store is sufficient for the task's causal structure?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

The more you try to hardwire knowledge into an AI, the more it damages what the model already knows.

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8