SYNTHESIS NOTE

Should successful and failed episodes be processed differently?

Explores whether asymmetric treatment of trajectories—preserving successes as full demonstrations while abstracting failures into lessons—could improve both the utility and efficiency of memory in reinforcement learning agents.

Synthesis note · 2026-05-18 · sourced from Reinforcement Learning

Existing memory-based RL methods primarily store raw trajectories. Raw trajectories are token-heavy and noise-saturated; storing them indiscriminately produces context pollution that degrades policy improvement. The alternative — uniform abstraction across all trajectories — destroys the specificity that makes the experience useful.

SkillRL (2602.08234) introduces differential processing as the load-bearing architectural choice. Successful episodes are preserved as full demonstrations — their specific action sequences are exactly what should be reused. Failed episodes are synthesized into concise failure lessons — the specifics of what went wrong don't transfer, but the abstracted lesson does. The asymmetry mirrors how human experts treat experience: remember concrete successes vividly, generalize failures into rules.

The two trajectory types feed a hierarchical SkillBank, partitioned into general skills (universal strategic guidance) and task-specific skills (task-level heuristics). The skill library co-evolves with the agent's policy through recursive failure analysis — each new RL iteration both refines the policy and updates the skill library based on what worked and what didn't.

The differential-processing claim resolves a tension across the agent-memory literature. Does agent memory degrade when continuously consolidated? shows that uniform consolidation regresses below baseline because the consolidation step strips applicability conditions. SkillRL's asymmetric treatment is the proposed fix: preserve raw episodes where the specifics matter (successes), abstract where they don't (failures-as-lessons). This is the third positive case for the condition-preservation hypothesis — alongside ReasoningBank (strategy-level distillation with conditions) and CLIN (causal abstractions preserving "may be necessary"). See ops/tensions/strategy-distillation helps when applicability conditions survive — and hurts when they are stripped.md.

The conceptual move is that abstraction is the right operation for some trajectory types and the wrong operation for others. Treating all experience the same — uniformly raw OR uniformly abstracted — is the failure mode. The right architecture differentiates by trajectory type, with the differentiation being driven by what each type actually contributes to future decision-making.

Empirically, SkillRL achieves state-of-the-art on ALFWorld and WebShop while using substantially less context than raw-trajectory-based memory approaches. The compression comes from the abstraction-of-failures step; the performance comes from preserving the demonstrations-of-successes step. Both halves of the asymmetry are doing work.

Update (2026-05-28) — the topological expression of the success-side operation. FluxMem (2605.28773, "Rethinking Memory as Continuously Evolving Connectivity") performs the differential-processing principle's success-side step as graph topology rather than a skill library. Its Long-Term Consolidation stage clusters recurring successful trajectories and crystallizes them into stable procedural circuits — high-utility pathways that mature (monitored by a convergence metric) so that recurring tasks bypass redundant retrieval and directly activate the mature subgraph. This is SkillRL's "preserve successes as reusable demonstrations" claim recast on a heterogeneous memory graph: where SkillRL stores successful episodes as full demonstrations in a SkillBank, FluxMem stores them as crystallized connections between co-activated units. The convergence is informative — two independently developed systems land on the same operation (durably encode recurring successes for direct reuse) through different data structures, which strengthens the case that the success/failure asymmetry is a structural requirement of self-evolving agent memory, not an artifact of one architecture.

Inquiring lines that read this note 129

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do self-generated feedback mechanisms enable effective model learning?

How do aggregate reward models systematically exclude minority user preferences?

How should preference channels from historical sessions inform unified policy learning?

How can AI systems learn from failures without cascading errors?

How should dialogue systems best leverage conversation history for retrieval?

Why do abstract semantic memories outperform specific interaction histories for journey discovery?

What determines success in training models on multiple tasks?

What constrains reinforcement learning's ability to expand model reasoning?

How can process reward models supervise complex reasoning traces?

What pretraining choices and baseline capability constrain reinforcement learning gains?

What memory architectures best support persistent reasoning across extended interactions?

Why do agents confidently report success despite actually failing tasks?

Can alternative training methods improve on supervised fine-tuning for language models?

Can ensemble evaluation methods reduce bias more than single judges?

What makes trajectory more actionable than absolute scores for human moderators?

What properties determine whether reward signals teach genuine reasoning?

Does self-reflection enable models to reliably correct their errors?

How does AI adoption affect human skill development and labor equality?

Does AI-assisted performance transfer to independent task completion?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

Does externalizing cognitive work and state improve agent reliability?

How should iterative research systems allocate reasoning per search step?

Can step-level rewards improve training of agentic retrieval systems?

Can language model RL training avoid reward hacking and misalignment?

How can AI agents autonomously learn and transfer skills across tasks?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Can prompting inject entirely new knowledge into language models?

How does explicit exploratory prompting compare to fine-tuned reinforcement learning for in-context adaptation?

When does optimizing for quality undermine the value of diversity?

Why does entropy-based frame sampling work better than uniform stride selection?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

Is reward propagation in RL formally dual to cause inference in memory?

How does latent reasoning compare to verbalized chain-of-thought?

Do depth thresholds correspond to transitions between procedural and strategic learning?

How should agents balance memory condensation to optimize context efficiency?

Does alignment training create blind spots in detecting genuine safety threats?

What makes behavioral cloning produce more persuadable but less aligned agents?

Does reinforcement learning teach reasoning or just when to reason?

Can models learn both what and how to study through reinforcement learning?

What memory abstraction level best enables agent knowledge reuse?

How can models identify insufficient information and respond appropriately without guessing?

What makes abstention a learnable behavior instead of a default penalty?

Why do reward structures fail to shape long-term agent learning?

How should memory consolidation strategies shape agent performance over time?

Why does consolidated memory sometimes degrade agent performance?

How do multi-agent systems achieve genuine cooperation and reasoning?

Can AI systems balance emotional competence with factual reliability?

How does curriculum learning prevent instability in social-emotional RL training?

How does sequence length affect sparsity tolerance in models?

Could activation sparsity signal task difficulty and guide routing decisions?

What makes weaker teacher models effective for stronger student training?

Why does information asymmetry between teacher and student enable effective feedback learning?

How do policy learning algorithm choices affect multi-objective optimization stability?

Why does gradient discarding limit standard policy clipping?

How do training priors constrain what context information can override?

How should models express uncertainty rather than forced confident answers?

Can agents escape weak belief tracking and conservative action selection traps?

How should retrieval systems optimize for multi-step reasoning during inference?

How does accumulated context history degrade iteration quality in long-horizon tasks?

Why do multi-turn conversations degrade AI intent and coherence?

How does bounded committed state prevent multi-turn agent failures better than transcript replay?

Why do benchmark improvements fail to reflect actual reasoning quality?

Why do task-completion benchmarks miss the competence of knowing when to abstain?

Can single-axis benchmarks accurately predict agent deployment success?

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 86 in 2-hop network ·medium cluster Open in graph ↗

Should successful and failed episodes be process… Does agent memory degrade when continuously consol… Can agents learn better from their failures than s… Can frozen language models continually improve thr… Can agents learn reusable sub-task routines from p… Can agents learn new skills without forgetting old… Can a separate trained curator improve skill libra… Can agents adapt without pausing service to users? Does creating skills inside the agent loop elimina…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does agent memory degrade when continuously consolidated? Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.
diagnoses the failure mode SkillRL's differential processing addresses
Can agents learn better from their failures than successes? Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.
ReasoningBank also distills from successes AND failures but treats both as strategies; SkillRL treats successes as demonstrations (raw) and failures as lessons (abstracted) — same idea applied with different granularity
Can frozen language models continually improve through memory structure alone? If agents can't update parameters, what form of textual memory lets them keep learning across trials and transfer to new tasks without retraining?
CLIN preserves applicability conditions via causal form; SkillRL preserves them by treating success-trajectories as raw
Can agents learn reusable sub-task routines from past experience? Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.
AWM compounds workflows; SkillRL compounds skills hierarchically — same compositional principle, asymmetric trajectory processing as added axis
Can agents learn new skills without forgetting old ones? Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.
VOYAGER is the predecessor; SkillRL adds the success-failure asymmetry and online RL refinement
Can a separate trained curator improve skill libraries better than frozen agents? Explores whether decoupling skill curation from agent execution enables better long-term learning of what skills to keep, delete, or refine. Matters because manual curation doesn't scale and heuristic approaches lack feedback.
SkillOS is the complementary axis: SkillRL differentiates *what gets stored* (success demos vs failure lessons); SkillOS differentiates *who learns from the storage* (curator vs executor). SkillRL's asymmetric trajectory processing is a candidate ingredient inside SkillOS's curator
Can agents adapt without pausing service to users? Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.
MetaClaw decomposes adaptation across timescales using SkillRL-like failure-distillation as its fast-timescale mechanism; MetaClaw's contribution is adding the slow-timescale weight-update channel
Does creating skills inside the agent loop eliminate mismatches? Can coupling skill creation directly to the runtime reasoning loop—rather than authoring skills offline—close the gap between when skills are made and when they're used? This matters for whether agents can ground new capabilities in their actual situated context.
synthesizes: both ground skills in the agent's own situated trajectory rather than out-of-loop authoring, here via in-loop creation, there via differential trajectory processing

Should successful and failed episodes be processed differently?

Inquiring lines that read this note 129

Related concepts in this collection 8

Related papers in this collection 8

Search by related questions 4