SYNTHESIS NOTE

Can agents learn continuously from experience without updating weights?

This explores whether LLM agents can adapt to new tasks and failures by retrieving past experiences from memory alone, rather than requiring expensive parameter fine-tuning or rigid hardcoded rules.

Synthesis note · 2026-02-23 · sourced from Memory

AgentFly addresses a central challenge: LLM agents either follow rigid hardcoded workflows (inflexible) or require parameter fine-tuning (expensive, impractical for continual adaptation). The alternative: learn continuously through memory, not weight updates.

The formalization is a Memory-augmented Markov Decision Process (M-MDP). The agent stores past trajectories as episodic traces — including both successes and failures — and retrieves similar past experiences to guide current decision-making. This aligns with case-based reasoning (CBR), a psychologically grounded learning strategy: humans often solve problems by recalling analogous past situations.

Three memory modules serve distinct functions:

Case Memory — vectorized storage of prior task trajectories (task, plan, success/failure label). Supports retrieval via similarity-based search or an online-updating Q-function. This is the strategic memory: which approaches worked for which kinds of problems.
Subtask Memory — text-based storage of active subtasks and their execution results. Orchestrates the planner-executor interaction within a single task. This is the working memory: what's being done right now.
Tool Memory — text-based logs of tool interactions scoped per subtask. Records what tools were used, what they returned. This is the procedural memory: how specific operations were executed.

The learning mechanism: credit assignment happens via memory rewriting (updating case labels and Q-values based on outcome), and policy improvement happens via memory reading (retrieving relevant cases that shift the planning distribution). No gradient updates to the LLM — the LLM is a fixed reasoning engine, and adaptation happens entirely through what's retrieved into its context.

The result: top-1 on GAIA validation (87.88% Pass@3) and 79.40% on the test set, in the deep research setting.

Since Can agents learn from failure without updating their weights?, AgentFly provides the formal RL framework for this intuition: the M-MDP formalization shows how credit assignment and policy improvement can operate entirely through memory operations. The Q-function over cases provides a principled retrieval policy that improves with experience, rather than relying on static similarity-based retrieval.

Reweave 2026-05-18 — memory-vs-fine-tuning is not binary; the right architecture is dual-timescale. AgentFly's original framing positioned memory-based adaptation as the alternative to fine-tuning — choose one. Late-2025 evidence reframes this as a false dichotomy. Can agents adapt without pausing service to users? shows that production systems can have BOTH: memory-based adaptation on the fast timescale (zero downtime) AND LoRA fine-tuning during user-inactive windows (no service interruption). MetaClaw's OMLS scheduler monitors sleep hours, keyboard inactivity, and calendar occupancy to identify safe windows for weight updates.

The implication for AgentFly's design: its case bank addresses the fast-timescale adaptation problem, but the underlying LLM policy weights remain static — meaning failures that require new capabilities (not just new cases) cannot be resolved by case-based retrieval alone. A dual-timescale architecture would extend AgentFly with idle-window fine-tuning over the accumulated case bank as training data. The case bank becomes both the working memory (fast retrieval) AND the training dataset (slow weight updates). This is what Does agent memory degrade when continuously consolidated? also points toward — the right architecture preserves raw cases as first-class evidence but uses them deliberately for both retrieval and training, with explicit gating.

The corollary: when memory-based RL is presented as "no fine-tuning needed," that framing is correct for the deployment cost story but incomplete for the capability story. Fine-tuning during idle windows is essentially free in production cost terms, and addresses what memory-only systems cannot.

Inquiring lines that read this note 132

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do aggregate reward models systematically exclude minority user preferences?

How should preference channels from historical sessions inform unified policy learning?

What memory abstraction level best enables agent knowledge reuse?

How should we design LLM systems to maintain alignment and control?

What deployment feedback loops amplify LLM pretraining popularity in live systems?

What memory architectures best support persistent reasoning across extended interactions?

How should agents balance memory condensation to optimize context efficiency?

How do multi-agent systems achieve genuine cooperation and reasoning?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

How should models express uncertainty rather than forced confident answers?

How can AI agents autonomously learn and transfer skills across tasks?

Does self-reflection enable models to reliably correct their errors?

What makes self-modifying architectures learn their own update rules?

How should memory consolidation strategies shape agent performance over time?

How does AI adoption affect human skill development and labor equality?

Does narrow reallocation to remaining tasks constitute genuine adaptation?

Can alternative training methods improve on supervised fine-tuning for language models?

What coordination failures limit multi-agent LLM systems as they scale?

Why do LLM agents fail where game-theoretic bots succeed?

Does externalizing cognitive work and state improve agent reliability?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

How can conversational AI maintain consistent personas across conversations?

Can online RL and trainable agents maintain persona consistency better than fixed environments?

Can AI-generated outputs constitute genuine knowledge or valid claims?

Can humans learn accurate models of AI through repeated interaction without labels?

Do language models develop causal world models or rely on statistical patterns?

What data presentation structures enable LLMs to learn decision-making from examples?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

Is reward propagation in RL formally dual to cause inference in memory?

How should iterative research systems allocate reasoning per search step?

Can AI systems develop genuine social understanding without embodiment?

How should CASA theory be updated for modern personalized agents?

What makes weaker teacher models effective for stronger student training?

Why is offline knowledge distillation preferred when in-session signals matter?

How do training priors constrain what context information can override?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

What happens to model reasoning when policy entropy collapses during RL?

Why do reward structures fail to shape long-term agent learning?

Why does finetuning cause catastrophic forgetting of model capabilities?

Why do reasoning models fail at systematic problem-solving and search?

Can instance-adaptive reasoning happen without sequential token dependencies?

Why do agents confidently report success despite actually failing tasks?

Why do benchmark improvements fail to reflect actual reasoning quality?

Can a model be strong at MMLU but weak at long-horizon tasks?

How do prompt structure and constraints affect model instruction reliability?

Can this approach handle continuously changing product inventories in production?

What pretraining choices and baseline capability constrain reinforcement learning gains?

Why does consolidated memory sometimes degrade agent performance?

What drives capability and cost efficiency in agent systems?

How should conversational agents balance goal-driven initiative with user control?

How can agents learn user preferences during conversation without pre-calibration?

Does reinforcement learning teach reasoning or just when to reason?

How does test-time aggregation affect reasoning correctness and reliability?

What makes consensus games work without retraining the base model?

How does AI assistance affect human cognitive development and reasoning autonomy?

Why does continuous agent inference differ from human user inference?

How should systems govern persistent agent-generated code in shared infrastructure?

How do agents decide which created code deserves long-term persistence?

Why do self-improving systems struggle without clear external performance metrics?

Why do persistent AI systems require fundamentally different design than ad-hoc supporters?

How do self-generated feedback mechanisms enable effective model learning?

How does scaffolding unstable mechanics improve reinforcement learning for search?

Do harness improvements transfer across model scales or memorize shortcuts?

Can harness updates benefit agents equally across all model sizes?

Related concepts in this collection 9

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 108 in 2-hop network ·medium cluster Open in graph ↗

Can agents learn continuously from experience wi… How does treating LLMs as multi-step agents change… Can agents learn better from their failures than s… Does agent memory degrade when continuously consol… Can agents learn from failure without updating the… Can agents learn new skills without forgetting old… Can careful selection of 78 demos outperform massi… How do agentic AI systems decompose into adaptatio… Can agents adapt without pausing service to users?

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

How does treating LLMs as multi-step agents change what we can optimize? Instead of optimizing single prompt-response pairs, what happens when we model LLM agents as temporally-extended decision processes? The question matters because it shifts what becomes trainable.
AgentFly's M-MDP is one concrete instantiation of the broader POMDP paradigm the Agentic RL survey names — memory-as-RL-target generalizes beyond AgentFly's case-based formulation
Can agents learn better from their failures than successes? Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.
ReasoningBank generalizes AgentFly's case-based approach: AgentFly stores trajectories as cases, ReasoningBank abstracts trajectories into strategies; both reject parameter updates as the learning mechanism but disagree on what gets stored
Does agent memory degrade when continuously consolidated? Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.
warning relevant to AgentFly's case rewriting: when the rewriting mechanism is itself an LLM consolidation step, the inverted-U applies; AgentFly's similarity-based retrieval over raw cases may be partially safe because it skips abstraction
Can agents learn from failure without updating their weights? Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.
AgentFly adds M-MDP formalization: credit assignment via memory rewriting, policy improvement via memory reading
Can agents learn new skills without forgetting old ones? Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.
VOYAGER composes skills; AgentFly composes cases. Both achieve continual learning without parameter updates
Can careful selection of 78 demos outperform massive training datasets? Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
AgentFly's case bank grows from experience; the efficiency principle suggests a small number of high-quality cases may suffice
How do agentic AI systems decompose into adaptation paradigms? What are the core dimensions that distinguish different approaches to adapting agents and tools in agentic systems? Understanding this taxonomy could clarify which adaptation strategy fits which problem.
AgentFly is agent-optimized with execution-signaled feedback via memory rewriting
Can agents adapt without pausing service to users? Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.
MetaClaw extends AgentFly's single-timescale memory-based adaptation with a second timescale (idle-window LoRA fine-tuning) — addresses what AgentFly cannot: improving the underlying policy weights, not just the retrievable case bank
Should agent memory adapt dynamically based on execution feedback? Can agents improve performance by continuously reshaping memory connections in response to whether tasks succeed or fail, rather than relying on fixed retrieval pipelines? This matters because static memory degrades in changing environments.
exemplifies: FluxMem's execution-feedback link editing is the topological form of adapting memory from outcomes without parameter updates

Can agents learn continuously from experience without updating weights?

Inquiring lines that read this note 132

Related concepts in this collection 9

Related papers in this collection 8

Search by related questions 4