What mechanism transfers explicit memories into parametric model weights?
This explores how knowledge stored as explicit, readable memory (text, episodes, retrieved context) gets baked into a model's internal weights — and the corpus's surprising answer is mostly about why you often shouldn't, and what breaks when you try.
This explores the mechanism that moves explicit memories — things stored as readable text, episodes, or retrieved context — into the model's parametric weights. The honest synthesis is that the collection circles this question from the opposite direction: most of its strongest work argues that the transfer into weights is the lossy, fragile step, and that keeping memory explicit often works better. So the most useful thing to learn here isn't a clean 'memory-to-weights' pipeline — it's why that pipeline tends to corrupt what it absorbs.
Start with what goes wrong when you do force explicit knowledge into weights. Direct fine-tuning, the obvious mechanism, turns out to damage the model's existing knowledge storage in its lower layers — which is why decoding-time proxy-tuning, which never touches the base weights and instead nudges the output distribution, actually beats fine-tuning on knowledge tasks Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The same lesson shows up as 'catastrophic forgetting': writing new lessons into weights overwrites old ones. Several notes treat this as a misallocation problem rather than an inherent cost — route the fast, task-specific stuff into text/prompts and let the slow weights barely move Can splitting adaptation into two channels reduce forgetting?, or freeze the backbone entirely and bolt on a small auxiliary model Can continuous reasoning avoid forgetting in instruction-tuned models?.
If you do want the transfer to be clean, the corpus hints at where the real mechanism lives: not the whole network, but a sparse, structured subnetwork. Reinforcement learning, it turns out, only updates 5–30% of parameters, and those updates are nearly full-rank and nearly identical across random seeds — meaning the model has structural 'slots' where new behavior gets written, not arbitrary smearing across all weights Does reinforcement learning update only a small fraction of parameters?. Push that further and you can deliberately isolate task-specific parameter regions and freeze them so new learning doesn't trample old Can isolating task-specific parameters prevent multi-task fine-tuning interference?, or train with sparse weights from the start so knowledge lands in interpretable, modular circuits Can sparse weight training make neural networks interpretable by design?. That's the closest thing to an actual mechanism: explicit knowledge consolidates into a small, localized set of weights, not the whole model.
But the most interesting counter-current is that a whole branch of the collection refuses the transfer altogether. Agents learn continuously by writing experience into an external memory store and never updating a single weight — episodic reflections from trial and error Can agents learn from failure without updating their weights?, formal memory-augmented reinforcement learning that hit 87.88% on a hard benchmark with frozen parameters Can agents learn continuously from experience without updating weights?, and causal-structured memory that even transfers to new environments because the memory's shape carries the applicability conditions weights would lose Can frozen language models continually improve through memory structure alone?. And architectures like Titans build a separate neural-memory module that compresses 'surprising' tokens into long-term storage alongside attention — a learned mechanism for deciding what's worth keeping, sitting next to the weights rather than dissolved into them Can neural memory modules scale language models beyond attention limits?.
The thing you didn't know you wanted to know: there may be no faithful mechanism for transferring explicit memory into weights, because identical model behavior can hide radically different internal structures, and gains in one property reliably degrade another like faithfulness or calibration What actually happens inside a language model?. The corpus's working answer is that the cleanest 'memory' is often the memory you leave explicit.
Sources 11 notes
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.
SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.
Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
Agents using causal-form memory (preserving applicability conditions) outperform generic reflection by 23 points on repeated trials and gain 4-17 points transferring to new environments, showing memory shape matters more than parameter updates.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.