SYNTHESIS NOTE

Can frozen language models continually improve through memory structure alone?

If agents can't update parameters, what form of textual memory lets them keep learning across trials and transfer to new tasks without retraining?

Synthesis note · 2026-05-03 · sourced from Action Models

CLIN argues that the bottleneck for continual learning in language agents is not parameter updates but the structure of what gets remembered. Reflexion-style agents (see Can agents learn from failure without updating their weights?) maintain "helpful hints" — generic verbal reflections that work for the immediate trial but transfer poorly across tasks and environments. CLIN's wager is that a specific style of memory — causal abstractions of the form "opening doors may be necessary for movement between rooms" — produces durable, transferable knowledge because causal structure is what predicts which action to take next.

Empirically the wager pays off. On ScienceWorld, CLIN beats SOTA reflective agents like Reflexion by 23 absolute points on repeated trials. More importantly it transfers: zero-shot performance on new environments improves by 4 points (13 for new tasks), and continued memory updates in the new setting add another 17 points (7 for new tasks). The causal-abstraction memory is therefore not just a within-task accelerator but a substrate for cross-environment generalization.

The conceptual move is to position language-model agents as a modern instantiation of action model learning — but with the action model written in natural language and continually edited rather than learned as parameters. Useful causal knowledge persists across trials, unhelpful causal knowledge is dropped. This suggests a new architectural pattern: agents built on frozen models can still continually and rapidly improve over time if the memory representation is the right shape. The shape that matters is causal, not encyclopedic — a position that pairs interestingly with Can agents learn reusable sub-task routines from past experience? (workflow-shaped memory) and Does state-indexed memory outperform high-level workflow memory for web agents? (state-action-shaped memory). The three notes target the same problem (what shape should agent memory take?) and disagree on the answer.

Why causal-form survives where heuristic consolidation fails. Late-2025 evidence reframes CLIN's success. The pattern "opening doors may be necessary for movement between rooms" is not just a useful abstraction — it is an applicability-conditional. The "may be necessary" preserves when the abstraction holds. Compare this to a heuristic summary like "always open doors to make progress," which strips the condition. See Does agent memory degrade when continuously consolidated? for the empirical case that LLM-driven consolidation regresses below no-memory baselines precisely because it strips applicability conditions, and see the tension ops/tensions/strategy-distillation helps when applicability conditions survive — and hurts when they are stripped.md for the resolution hypothesis CLIN exemplifies. CLIN's success and Reflexion's success may both reduce to the same axis: the question is not "raw or abstract" but "does the form preserve the conditions of application." Causal abstractions preserve them by syntactic design; heuristic summaries do not.

Inquiring lines that read this note 9

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What memory architectures best support persistent reasoning across extended interactions?

How can AI agents autonomously learn and transfer skills across tasks?

Can tool adaptation work without freezing the agent in the loop?

How does memorization interact with learning and generalization?

How much does memorization capacity limit a model's ability to learn new information?

Why do multi-turn conversations degrade AI intent and coherence?

How does model weight freezing across users affect virtual instance individuation?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Why does finetuning cause catastrophic forgetting of model capabilities?

What mechanism transfers explicit memories into parametric model weights?

Why does consolidated memory sometimes degrade agent performance?

What makes naive memory consolidation regress below having no memory at all?

Related concepts in this collection 9

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 93 in 2-hop network ·medium cluster Open in graph ↗

Can frozen language models continually improve t… Can agents learn reusable sub-task routines from p… Does state-indexed memory outperform high-level wo… Can agents learn from failure without updating the… Why do LLM agents ignore condensed experience summ… Does agent memory degrade when continuously consol… Can agents learn better from their failures than s… Should successful and failed episodes be processed… Can agents learn continuously from experience with…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can agents learn reusable sub-task routines from past experience? Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.
tension with: CLIN says causal-rule memory transfers; AWM says abstracted workflow-routine memory transfers; both make transferability the criterion but pick different memory shapes.
Does state-indexed memory outperform high-level workflow memory for web agents? Should procedural memory for web agents be organized around specific environment states and actions, or abstracted into higher-level workflows? This matters because web automation demands precise, context-sensitive recall that workflows might lose.
tension with: PRAXIS says local state-action memory beats both abstracted workflows and causal rules for web environments; the three notes form a memory-granularity tension.
Can agents learn from failure without updating their weights? Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.
extends: Reflexion is the baseline CLIN improves on by 23 points; the contrast is generic-hint memory vs causal-rule memory.
Why do LLM agents ignore condensed experience summaries? LLM agents faithfully learn from raw experience but systematically disregard condensed summaries of the same experience. This study investigates whether the problem lies in how summaries are made, how models process them, or whether models simply don't need them.
open question partially resolved: CLIN's causal form preserves applicability conditions ("may be necessary for X") that heuristic consolidation strips — the faithfulness asymmetry may bite generic summaries but not condition-preserving abstractions.
Does agent memory degrade when continuously consolidated? Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.
provides the empirical case that consolidation fails when applicability conditions are stripped; CLIN is the positive case for the resolution hypothesis (condition preservation is the load-bearing axis)
Can agents learn better from their failures than successes? Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.
ReasoningBank's strategy-level distillation operates at the same abstraction level as CLIN's causal rules; both preserve applicability conditions through syntactic design
Should successful and failed episodes be processed differently? Explores whether asymmetric treatment of trajectories—preserving successes as full demonstrations while abstracting failures into lessons—could improve both the utility and efficiency of memory in reinforcement learning agents.
SkillRL applies condition-preservation at trajectory-type granularity rather than per-instance: successes stay raw (specifics matter), failures abstracted (specifics don't transfer); CLIN preserves conditions syntactically, SkillRL preserves them by retention choice — same hypothesis, different mechanism
Can agents learn continuously from experience without updating weights? This explores whether LLM agents can adapt to new tasks and failures by retrieving past experiences from memory alone, rather than requiring expensive parameter fine-tuning or rigid hardcoded rules.
complements: both demonstrate that memory-shape choices enable continual adaptation without parameter updates; CLIN uses causal rules, the case-based variant uses retrieval over episodic cases.
Can skill documents be optimized like neural network weights? Can natural-language skill documents be treated as trainable parameters and improved through iterative optimization with validation gating, similar to how model weights are tuned in deep learning?
exemplifies: another frozen-model-plus-editable-text-state design generalizing the trainable-artifact claim

Can frozen language models continually improve through memory structure alone?

Inquiring lines that read this note 9

Related concepts in this collection 9

Related papers in this collection 8

Search by related questions 4