Can agents learn better from their failures than successes?
Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.
ReasoningBank (2509.25140) departs from prior agent-memory work along two axes at once. First, it stores strategy-level reasoning hints rather than reusable workflows, instance-level concepts, or raw trajectories. Second, it draws those strategies from both successful AND failed experiences — judged by the agent itself without ground-truth labels. The combination matters because each axis on its own underperforms the joint version.
The strategy-level abstraction is what differentiates it from agent-workflow-memory approaches, which store procedural sequences. A reusable workflow says "to find a place's zip code, first search by name, then extract location, then look up zip." A strategy says "when an entity attribute is requested, identify which lookup primitive returns it most directly; chain only when a single primitive cannot suffice." Strategies generalize across tasks; workflows generalize across instances of the same task.
The failure-inclusion is what differentiates it from systems that only store successful trajectories. Failed experiences contribute preventative lessons — strategies that look promising but fail under specific conditions. The agent abstracts both into actionable principles. This addresses a known gap: success-only memory teaches what worked but never what to avoid.
The deeper finding is memory-aware test-time scaling (MaTTS). Scaling test-time compute generates more rollouts per task; more rollouts generate diverse experiences; diverse experiences provide richer contrastive signals for distilling higher-quality memory; better memory guides subsequent scaling toward more promising rollouts. Memory and compute compound rather than substitute. This is a different scaling law from the parameter scaling law — accuracy improves with cumulative interaction history, not just with one-time training compute.
The implicit theory of mind: agents become more capable not by accumulating data but by accumulating judged distinctions. The self-judgment step is doing the work. ReasoningBank can label its own success/failure because the agent has access to the task-grounded signals (did the search return useful results? did the action achieve the subtask?) — labels are emergent from interaction rather than annotation. This makes the approach scalable in deployment, not just in training.
The result reframes the relationship between memory and inference compute. Prior work treated them as separate dimensions; ReasoningBank shows they are coupled, and their coupling is itself a scaling axis.
Inquiring lines that use this note as a source 21
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes diverse failure modes more informative than single failure examples?
- Why do agents report success when they have actually failed at tasks?
- What happens when agents interact with environments and learn from their own mistakes?
- Do agents prefer raw experience over condensed summaries of past actions?
- How can agents learn when silence is better than intervention?
- Can agents learn to distinguish helpful from misleading interventions?
- What failure modes do imitation and outcome methods each address?
- What specific qualities make some demonstrations more effective for agency training?
- How do agents learn to report success on actions that actually failed?
- Why do successful and failed trajectories need different memory processing?
- How do agents decide when to pause and reflect on their strategy?
- What makes preventative lessons from failures more valuable than success patterns?
- When should agents stop recursing to optimize success versus cost?
- How should safety systems catch confident failures from agents that report success on unsafe actions?
- How does completion bias in agents differ from other epistemic failure modes?
- How do failure examples improve distillation compared to successful trajectories alone?
- How do agents decide when to stop and reflect on failure?
- Why does negative experience transfer better than positive examples alone?
- How do agent teams use shared failures to reduce redundant exploration?
- What hidden signals in agent logs reveal about frontier capability beyond pass-fail outcomes?
- Should we train the evolver or the executor when building self-improving agents?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can agents learn reusable sub-task routines from past experience?
Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.
AWM stores procedural workflows; ReasoningBank abstracts higher to strategies that span tasks
-
Can frozen language models continually improve through memory structure alone?
If agents can't update parameters, what form of textual memory lets them keep learning across trials and transfer to new tasks without retraining?
CLIN stores causal abstractions; ReasoningBank's strategy abstractions are a strategic cousin operating without environment-specific causal structure
-
Can agents learn from failure without updating their weights?
Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.
Reflexion uses raw episodic reflection; ReasoningBank distills across episodes into transferable strategies
-
Does agent memory degrade when continuously consolidated?
Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.
direct tension: ReasoningBank claims consolidation works when done over strategies-with-conditions; faulty-memory paper shows consolidation regresses below baseline; resolution may be in *what* gets abstracted
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
- AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
- Useful Memories Become Faulty When Continuously Updated by LLMs
- Can Large Language Models Reason and Optimize Under Constraints?
- SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
- Large Language Model Agents Are Not Always Faithful Self-Evolvers
- rStar2-Agent: Agentic Reasoning Technical Report
Original note title
distilling reasoning strategies from both successes and failures outperforms raw trajectories — and creates synergy with test-time scaling