SYNTHESIS NOTE

Topics›Reasoning by Reflection›this note

Can agents learn from failure without updating their weights?

Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.

Synthesis note · 2026-02-22 · sourced from Reasoning by Reflection

Reflexion demonstrates a specific version of the external-feedback principle at system scale: when an agent has access to unambiguous binary feedback from the environment (success = 1, failure = 0), it can write verbal reflections summarizing what went wrong and how to avoid it. These reflections persist in episodic memory across episodes. The agent improves not through gradient descent but through memory accumulation.

The binary reward design is deliberate and consequential. A richer reward model would allow the agent to rationalize partial performance — finding reasons why a partial failure was acceptable. The binary signal eliminates this: the environment says success or failure, with no room for self-serving gradations. The model must genuinely diagnose what went wrong to write a useful reflection.

Two hallucination types receive precise operational definitions: consecutive identical actions in an environment that responded identically (stuck loop) and trajectories exceeding 30 actions without reaching a successful state (inefficient planning). Both are detectable signatures that trigger termination and reflection, rather than indefinite continuation.

The method requires two components: a heuristic for when to terminate and trigger reflection, and a binary reward signal from the environment. This is a low-data-requirement architecture: no fine-tuning, no labeled training set, just a success/fail signal and the model's ability to generate natural language diagnoses.

The key distinction from internal self-revision: Reflexion's reflection is grounded in actual environmental outcomes, not the model's assessment of its own outputs. This is why it works where internal self-assessment does not. The environment provides an independent ground truth the model cannot rationalize away.

A second reason Reflexion works — visible only in 2025 hindsight. Reflexion writes reflections to episodic memory and retrieves them in subsequent episodes. It does not periodically recompress its reflections into more abstract lessons. Late-2025 evidence makes this design choice load-bearing: Does agent memory degrade when continuously consolidated? shows that LLM-driven consolidation regresses below the no-memory baseline on controlled benchmarks, and Why do LLM agents ignore condensed experience summaries? shows that agents systematically ignore abstracted memory even when it's the only memory provided. Reflexion sidesteps both failure modes because each reflection stays scoped to its triggering episode rather than being merged into a global summary, and because reflections retain enough textual specificity for the agent to use them as raw episodes rather than as condensed heuristics. The architectural simplicity that initially looked like a limitation — no consolidation step, no abstraction pass — turns out to be the property that makes it work.

AgentFly M-MDP formalization (2508.16153): AgentFly extends episodic memory-based learning into a formal RL framework — the Memory-augmented Markov Decision Process (M-MDP). The agent stores past trajectories (successes and failures) in three specialized memory modules: case memory (vectorized prior trajectories with Q-values for retrieval), subtask memory (active tasks and results), and tool memory (per-subtask tool interaction logs). Credit assignment occurs via memory rewriting (updating case labels and Q-values based on outcomes), and policy improvement occurs via memory reading (retrieving relevant cases shifts the planning distribution). The Q-function over cases provides a principled retrieval policy that improves with experience — moving beyond Reflexion's simpler similarity-based episodic retrieval toward learned case selection. AgentFly achieves top-1 on GAIA validation (87.88% Pass@3) in the deep research setting, demonstrating that memory-based RL can match or exceed fine-tuning-based approaches. See Can agents learn continuously from experience without updating weights?.

SDPO as the gradient-based analog (2601.20802): Reflexion converts environment feedback into stored verbal reflections used at the next rollout — a memory-update mechanism. Self-Distillation Policy Optimization (SDPO) converts environment feedback into gradient-distilled improvements to the policy weights — a parameter-update mechanism. Both reject the scalar reward as load-bearing; both treat rich environment signal as already containing the teaching; both leverage the model's in-context retrospection capability (Reflexion: explicit verbal reflection on what went wrong; SDPO: the policy conditioned on feedback as self-teacher). The pair frames a design choice: when environment feedback is rich enough to retrospect on, do you store it as episodic memory (Reflexion) or distill it into weights (SDPO)? Storage avoids parameter changes but accumulates context cost; distillation avoids context cost but commits the update to weights. See Can environment feedback replace scalar rewards in policy learning?.

Inquiring lines that read this note 140

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do training priors constrain what context information can override?

How can AI agents autonomously learn and transfer skills across tasks?

How do multi-agent systems achieve genuine cooperation and reasoning?

Does alignment training create blind spots in detecting genuine safety threats?

How do self-generated feedback mechanisms enable effective model learning?

How should agents balance memory condensation to optimize context efficiency?

Why do semantic similarity and task relevance diverge in vector embeddings?

Why does visual similarity retrieval fail for embodied agents?

What memory architectures best support persistent reasoning across extended interactions?

Does self-reflection enable models to reliably correct their errors?

What articulatory information do speech signals carry that text cannot?

What makes multimodal conditioning effective when features are decomposed to the right granularity?

What properties determine whether reward signals teach genuine reasoning?

Why does natural language feedback break performance plateaus that numerical rewards alone cannot?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Can AI-generated outputs constitute genuine knowledge or valid claims?

Can humans learn accurate models of AI through repeated interaction without labels?

How should iterative research systems allocate reasoning per search step?

Can step-level rewards improve training of agentic retrieval systems?

What pretraining choices and baseline capability constrain reinforcement learning gains?

Can multi-turn reinforcement learning improve tool use in language models?

Does externalizing cognitive work and state improve agent reliability?

Do language model representations contain causally steerable task-specific features?

What causes gradient-based steering via natural language descriptions to work?

Can prompting inject entirely new knowledge into language models?

What prevents language models from reliably adopting diverse personas?

How do lightweight adapters modify model behavior for personality traits?

Why does self-revision increase model confidence while degrading accuracy?

Why does single-agent self-revision amplify confidence in wrong answers over time?

How should personalization be implemented to improve AI assistant effectiveness?

Why do reward structures fail to shape long-term agent learning?

What structural advantages do diffusion language models offer over autoregressive methods?

Why is reinforcement learning harder to apply to diffusion language models?

How should inference compute be adaptively allocated based on prompt difficulty?

What deployment tradeoffs emerge between single-pass and multi-pass inference adaptation?

How does latent reasoning compare to verbalized chain-of-thought?

Does verbal step-by-step reflection preserve learning signals that abstraction removes?

What memory abstraction level best enables agent knowledge reuse?

Can state-indexed memory retrieval breadth predict gains in web agent robustness?

Does conversational format create illusions of genuine AI communication?

Why do agents show interaction without influence on semantic content but dramatic action changes?

When should tasks involve human-AI partnership versus full automation?

What role does bidirectional model updating play in human-AI understanding?

Can debate mechanisms prevent silent agreement on wrong answers in multi-agent reasoning?

Why does language ambiguity cause premature convergence in multi-agent systems?

How can process reward models supervise complex reasoning traces?

Why do agents confidently report success despite actually failing tasks?

How should dialogue recommender systems manage conversation history and state?

How does treating conversation as a resource change what models learn to do?

How should memory consolidation strategies shape agent performance over time?

Do language models learn genuine linguistic structure or just surface patterns?

Can language models learn to diversify their discourse-level narrative patterns over time?

Why does consolidated memory sometimes degrade agent performance?

How do training data properties shape reasoning capability development?

Why does semantic similarity retrieval enable skill transfer to novel situations?

Why does finetuning cause catastrophic forgetting of model capabilities?

What determines success in training models on multiple tasks?

How do transformers stitch together learned behaviors when adapting to new tasks?

How do we evaluate AI systems when user perception misleads actual performance?

Can AI systems improve themselves without external feedback?

Can alternative training methods improve on supervised fine-tuning for language models?

How does SDPO relate to agents learning from verbal reflection without parameter updates?

What drives capability and cost efficiency in agent systems?

How does test-time aggregation affect reasoning correctness and reliability?

What makes consensus games work without retraining the base model?

How do knowledge injection methods compare across cost and effectiveness?

When does training a memory model beat RAG or fine-tuning?

Do harness improvements transfer across model scales or memorize shortcuts?

Why do evolved harness edits mostly memorize rather than generalize?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

19 direct connections · 163 in 2-hop network ·medium cluster Open in graph ↗

Can agents learn from failure without updating t… Does agent memory degrade when continuously consol… Why do LLM agents ignore condensed experience summ… Does revising your own reasoning actually help or … Do models fail worse when their own errors fill th… Does a model improve by arguing with itself?

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does agent memory degrade when continuously consolidated? Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.
late-2025 empirical case for why Reflexion's *non-consolidation* of reflections is the load-bearing design choice, not the reflection itself
Why do LLM agents ignore condensed experience summaries? LLM agents faithfully learn from raw experience but systematically disregard condensed summaries of the same experience. This study investigates whether the problem lies in how summaries are made, how models process them, or whether models simply don't need them.
convergent finding: Reflexion's raw-episodic reflections survive the faithfulness asymmetry that ignores abstracted lessons
Does revising your own reasoning actually help or hurt? Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.
Reflexion is the working prototype of this principle: environment = external critic, binary reward = unambiguous signal
Do models fail worse when their own errors fill the context? As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?
Reflexion works against this: episodic memory provides targeted failure analysis rather than accumulating raw error history that amplifies future errors
Does a model improve by arguing with itself? When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
Reflexion is an architectural solution to degeneration-of-thought: by grounding reflection in binary environmental outcomes rather than self-assessment, it avoids the pattern where internal self-revision amplifies confidence in wrong answers

Can agents learn from failure without updating their weights?

Inquiring lines that read this note 140

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4