Can agents learn continuously from experience without updating weights?
This explores whether LLM agents can adapt to new tasks and failures by retrieving past experiences from memory alone, rather than requiring expensive parameter fine-tuning or rigid hardcoded rules.
AgentFly addresses a central challenge: LLM agents either follow rigid hardcoded workflows (inflexible) or require parameter fine-tuning (expensive, impractical for continual adaptation). The alternative: learn continuously through memory, not weight updates.
The formalization is a Memory-augmented Markov Decision Process (M-MDP). The agent stores past trajectories as episodic traces — including both successes and failures — and retrieves similar past experiences to guide current decision-making. This aligns with case-based reasoning (CBR), a psychologically grounded learning strategy: humans often solve problems by recalling analogous past situations.
Three memory modules serve distinct functions:
Case Memory — vectorized storage of prior task trajectories (task, plan, success/failure label). Supports retrieval via similarity-based search or an online-updating Q-function. This is the strategic memory: which approaches worked for which kinds of problems.
Subtask Memory — text-based storage of active subtasks and their execution results. Orchestrates the planner-executor interaction within a single task. This is the working memory: what's being done right now.
Tool Memory — text-based logs of tool interactions scoped per subtask. Records what tools were used, what they returned. This is the procedural memory: how specific operations were executed.
The learning mechanism: credit assignment happens via memory rewriting (updating case labels and Q-values based on outcome), and policy improvement happens via memory reading (retrieving relevant cases that shift the planning distribution). No gradient updates to the LLM — the LLM is a fixed reasoning engine, and adaptation happens entirely through what's retrieved into its context.
The result: top-1 on GAIA validation (87.88% Pass@3) and 79.40% on the test set, in the deep research setting.
Since Can agents learn from failure without updating their weights?, AgentFly provides the formal RL framework for this intuition: the M-MDP formalization shows how credit assignment and policy improvement can operate entirely through memory operations. The Q-function over cases provides a principled retrieval policy that improves with experience, rather than relying on static similarity-based retrieval.
Reweave 2026-05-18 — memory-vs-fine-tuning is not binary; the right architecture is dual-timescale. AgentFly's original framing positioned memory-based adaptation as the alternative to fine-tuning — choose one. Late-2025 evidence reframes this as a false dichotomy. Can agents adapt without pausing service to users? shows that production systems can have BOTH: memory-based adaptation on the fast timescale (zero downtime) AND LoRA fine-tuning during user-inactive windows (no service interruption). MetaClaw's OMLS scheduler monitors sleep hours, keyboard inactivity, and calendar occupancy to identify safe windows for weight updates.
The implication for AgentFly's design: its case bank addresses the fast-timescale adaptation problem, but the underlying LLM policy weights remain static — meaning failures that require new capabilities (not just new cases) cannot be resolved by case-based retrieval alone. A dual-timescale architecture would extend AgentFly with idle-window fine-tuning over the accumulated case bank as training data. The case bank becomes both the working memory (fast retrieval) AND the training dataset (slow weight updates). This is what Does agent memory degrade when continuously consolidated? also points toward — the right architecture preserves raw cases as first-class evidence but uses them deliberately for both retrieval and training, with explicit gating.
The corollary: when memory-based RL is presented as "no fine-tuning needed," that framing is correct for the deployment cost story but incomplete for the capability story. Fine-tuning during idle windows is essentially free in production cost terms, and addresses what memory-only systems cannot.
Inquiring lines that use this note as a source 109
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How should preference channels from historical sessions inform unified policy learning?
- How should GUI agents remember patterns across different software environments?
- What deployment feedback loops amplify LLM pretraining popularity in live systems?
- Why does persistent memory alone fail to create genuine position-holding in models?
- Can environmental scaffolding replace internal memory scaling in agent design?
- Do dynamic environments enable different kinds of agent-environment coevolution?
- Why does fine-tuning for continuous space cause catastrophic forgetting?
- Why do weak belief tracking and conservative actions trap agents in low-information states?
- What domain properties determine whether causal rules transfer to new agents?
- How does real tool integration change what agents learn compared to simulated tools?
- Can continuum memory systems prevent catastrophic forgetting in neural networks?
- What makes self-modifying architectures learn their own update rules?
- What memory and planning capabilities do AI companions need for evolving user needs?
- Does narrow reallocation to remaining tasks constitute genuine adaptation?
- Can self-distillation reduce catastrophic forgetting in continual learning?
- How do neural networks extend contextual bandits beyond linear reward assumptions?
- Why do LLM agents fail where game-theoretic bots succeed?
- How do virtual model instances preserve identity through load-balancing and failover?
- What access constraints allow description-based adaptation but block conventional techniques?
- How do agentic systems recover when specialized models operate outside their scope?
- What happens when agents interact with environments and learn from their own mistakes?
- Can online RL and trainable agents maintain persona consistency better than fixed environments?
- Can gradient approximation at equilibrium replace backpropagation through time in practice?
- Can humans learn accurate models of AI through repeated interaction without labels?
- Can combinational creativity alone drive open-ended learning in agents?
- Why do memory and feedback loops matter more than model size for agent reliability?
- What data presentation structures enable LLMs to learn decision-making from examples?
- Can episodic memory alone enable learning without parameter updates?
- Is reward propagation in RL formally dual to cause inference in memory?
- How do cascaded probabilistic models compare to reinforcement learning for per-query system design?
- Can agents improve from deployment signals without explicit human annotation?
- What infrastructure decouples generation from training in asynchronous agent loops?
- How should CASA theory be updated for modern personalized agents?
- Can episodic memory of UI traces improve open-world agent adaptation?
- How can agents learn when silence is better than intervention?
- Can RL-trained meta-agents match or exceed manually designed workflows?
- Why is offline knowledge distillation preferred when in-session signals matter?
- How does component-level self-evolution prevent information loss in multi-agent trajectories?
- Can models internalize retrieved context as static parametric knowledge?
- What happens to model reasoning when policy entropy collapses during RL?
- Why do agents fail to internalize value from informative observations?
- Can expert vectors learned offline transfer across multiple model architectures?
- Why do pretrained model priors reduce the usefulness of retrieved experience?
- What persistent memory architectures best support storing precomputed inferences across sessions?
- Why does fine-tuning models for continuous reasoning cause catastrophic forgetting?
- Can state-indexed memory retrieval breadth predict gains in web agent robustness?
- How does PRAXIS differ architecturally from Agent Workflow Memory and causal rule learning?
- What role does self-learning play in improving agent reasoning without annotation?
- Can historical and batch exploration be implemented with the same algorithmic mechanism?
- What non-parametric methods could replace latent factors for inductive learning?
- Can instance-adaptive reasoning happen without sequential token dependencies?
- Why do completion-mode strengths not transfer to agentic settings?
- Can a model be strong at MMLU but weak at long-horizon tasks?
- Can this approach handle continuously changing product inventories in production?
- How should humans specify deterministic abstractions of RL problems?
- Can curator modules trained on one executor transfer to entirely different agent backbones?
- Can agents compress long trajectories without losing critical decision context?
- How do token, parametric, and latent memory forms coexist in single agents?
- Can individual skills improve through reuse and accumulate experience across tasks?
- What is the right granularity level for agent memory to enable both reuse and composition?
- Do learned workflows transfer between different agents with minimal accuracy loss?
- How can memory shift from a passive datastore to an actively trained component?
- When does memory consolidation help agents instead of hurting performance?
- Can agent-controlled memory management outperform fixed consolidation schedules?
- Does workflow-level memory or state-action memory better capture reusable agent knowledge?
- Why does LLM memory consolidation regress below no-memory baselines?
- Can applicability conditions be preserved automatically when agents reflect on trials?
- Can AI models retain knowledge across changing environments without catastrophic forgetting?
- Can neural modules memorize surprising tokens as adaptive long-term memory?
- Why do continuously consolidated agent memories eventually degrade below no-memory baseline?
- Can LLM-synthesized behavioral heuristics compete with learned policy improvements?
- What training method supports dynamic tool discovery in long-horizon agents?
- How does memory folding enable agents to reconsider strategies mid-task?
- How do planning and memory compress agentic system costs?
- What mechanism transfers explicit memories into parametric model weights?
- Can offline recurrent passes replicate sleep-based memory consolidation in AI?
- How can agents learn user preferences during conversation without pre-calibration?
- Can reinforcement learning close the gap between LLM reasoning and action?
- How does KL regularization prevent both forgetting and adaptation loss?
- Can zero-weight drift through external memory replace parameter plasticity entirely?
- What distinguishes working memory from strategic memory in agent task execution?
- Can memory-based adaptation and gradient fine-tuning operate on complementary timescales?
- Why do current metacognitive training loops fail when agents encounter new domains?
- What makes exploration and reflection rewards verifiable in agentic environments?
- Why does credit assignment through memory rewriting avoid expensive LLM parameter updates?
- How should abstraction preserve applicability conditions when distilling experience?
- What makes a learned consolidation rule lossy and where does contamination enter?
- How does SDPO relate to agents learning from verbal reflection without parameter updates?
- How do fast and slow timescales enable continual agent adaptation?
- Can models recover knowledge with completely unrelated retraining tasks?
- How does in-weights adaptation create spurious forgetting in models?
- Can we design efficient agents by targeting constraints directly?
- Can models consolidate context into weights during idle offline phases?
- Do long-term memory modules outperform consolidation into fast weights?
- How can a forgetting policy preserve rare knowledge while preventing over-generalization?
- What properties of agent systems only become visible across multiple sessions?
- How does durable memory quality shape agent performance over time?
- What makes consensus games work without retraining the base model?
- Why does memory consolidation degrade agent performance below baseline?
- Why does continuous agent inference differ from human user inference?
- What can agents learn from the brain's complementary learning systems?
- Can the same compress-then-act pattern work for agent state memory?
- Can context management policies transfer across agents of similar capability levels?
- Which agent architectures consistently outperform base models on hard prediction questions?
- What separates artifact recall from persistent memory commitment in agents?
- How should agents compress episodic interactions into working memory without accumulation?
- Can agents escape weak belief tracking and conservative action selection traps?
- Why does externalized state beat parameter scaling for agent reliability?
- How does externalizing reasoning into harness artifacts improve agent reliability?
Related concepts in this collection 9
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
How does treating LLMs as multi-step agents change what we can optimize?
Instead of optimizing single prompt-response pairs, what happens when we model LLM agents as temporally-extended decision processes? The question matters because it shifts what becomes trainable.
AgentFly's M-MDP is one concrete instantiation of the broader POMDP paradigm the Agentic RL survey names — memory-as-RL-target generalizes beyond AgentFly's case-based formulation
-
Can agents learn better from their failures than successes?
Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.
ReasoningBank generalizes AgentFly's case-based approach: AgentFly stores trajectories as cases, ReasoningBank abstracts trajectories into strategies; both reject parameter updates as the learning mechanism but disagree on what gets stored
-
Does agent memory degrade when continuously consolidated?
Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.
warning relevant to AgentFly's case rewriting: when the rewriting mechanism is itself an LLM consolidation step, the inverted-U applies; AgentFly's similarity-based retrieval over raw cases may be partially safe because it skips abstraction
-
Can agents learn from failure without updating their weights?
Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.
AgentFly adds M-MDP formalization: credit assignment via memory rewriting, policy improvement via memory reading
-
Can agents learn new skills without forgetting old ones?
Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.
VOYAGER composes skills; AgentFly composes cases. Both achieve continual learning without parameter updates
-
Can careful selection of 78 demos outperform massive training datasets?
Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
AgentFly's case bank grows from experience; the efficiency principle suggests a small number of high-quality cases may suffice
-
How do agentic AI systems decompose into adaptation paradigms?
What are the core dimensions that distinguish different approaches to adapting agents and tools in agentic systems? Understanding this taxonomy could clarify which adaptation strategy fits which problem.
AgentFly is agent-optimized with execution-signaled feedback via memory rewriting
-
Can agents adapt without pausing service to users?
Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.
MetaClaw extends AgentFly's single-timescale memory-based adaptation with a second timescale (idle-window LoRA fine-tuning) — addresses what AgentFly cannot: improving the underlying policy weights, not just the retrievable case bank
-
Should agent memory adapt dynamically based on execution feedback?
Can agents improve performance by continuously reshaping memory connections in response to whether tasks succeed or fail, rather than relying on fixed retrieval pipelines? This matters because static memory degrades in changing environments.
exemplifies: FluxMem's execution-feedback link editing is the topological form of adapting memory from outcomes without parameter updates
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Useful Memories Become Faulty When Continuously Updated by LLMs
- AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
- ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
- SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
- Rethinking Memory as Continuously Evolving Connectivity
- The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
- Large Language Model Agents Are Not Always Faithful Self-Evolvers
- SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
Original note title
memory-based online reinforcement learning enables continual agent adaptation without fine-tuning through episodic case-based reasoning