Should successful and failed episodes be processed differently?
Explores whether asymmetric treatment of trajectories—preserving successes as full demonstrations while abstracting failures into lessons—could improve both the utility and efficiency of memory in reinforcement learning agents.
Existing memory-based RL methods primarily store raw trajectories. Raw trajectories are token-heavy and noise-saturated; storing them indiscriminately produces context pollution that degrades policy improvement. The alternative — uniform abstraction across all trajectories — destroys the specificity that makes the experience useful.
SkillRL (2602.08234) introduces differential processing as the load-bearing architectural choice. Successful episodes are preserved as full demonstrations — their specific action sequences are exactly what should be reused. Failed episodes are synthesized into concise failure lessons — the specifics of what went wrong don't transfer, but the abstracted lesson does. The asymmetry mirrors how human experts treat experience: remember concrete successes vividly, generalize failures into rules.
The two trajectory types feed a hierarchical SkillBank, partitioned into general skills (universal strategic guidance) and task-specific skills (task-level heuristics). The skill library co-evolves with the agent's policy through recursive failure analysis — each new RL iteration both refines the policy and updates the skill library based on what worked and what didn't.
The differential-processing claim resolves a tension across the agent-memory literature. Does agent memory degrade when continuously consolidated? shows that uniform consolidation regresses below baseline because the consolidation step strips applicability conditions. SkillRL's asymmetric treatment is the proposed fix: preserve raw episodes where the specifics matter (successes), abstract where they don't (failures-as-lessons). This is the third positive case for the condition-preservation hypothesis — alongside ReasoningBank (strategy-level distillation with conditions) and CLIN (causal abstractions preserving "may be necessary"). See ops/tensions/strategy-distillation helps when applicability conditions survive — and hurts when they are stripped.md.
The conceptual move is that abstraction is the right operation for some trajectory types and the wrong operation for others. Treating all experience the same — uniformly raw OR uniformly abstracted — is the failure mode. The right architecture differentiates by trajectory type, with the differentiation being driven by what each type actually contributes to future decision-making.
Empirically, SkillRL achieves state-of-the-art on ALFWorld and WebShop while using substantially less context than raw-trajectory-based memory approaches. The compression comes from the abstraction-of-failures step; the performance comes from preserving the demonstrations-of-successes step. Both halves of the asymmetry are doing work.
Update (2026-05-28) — the topological expression of the success-side operation. FluxMem (2605.28773, "Rethinking Memory as Continuously Evolving Connectivity") performs the differential-processing principle's success-side step as graph topology rather than a skill library. Its Long-Term Consolidation stage clusters recurring successful trajectories and crystallizes them into stable procedural circuits — high-utility pathways that mature (monitored by a convergence metric) so that recurring tasks bypass redundant retrieval and directly activate the mature subgraph. This is SkillRL's "preserve successes as reusable demonstrations" claim recast on a heterogeneous memory graph: where SkillRL stores successful episodes as full demonstrations in a SkillBank, FluxMem stores them as crystallized connections between co-activated units. The convergence is informative — two independently developed systems land on the same operation (durably encode recurring successes for direct reuse) through different data structures, which strengthens the case that the success/failure asymmetry is a structural requirement of self-evolving agent memory, not an artifact of one architecture.
Inquiring lines that use this note as a source 110
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can unified policies handle negative feedback and critique transformation simultaneously?
- How should preference channels from historical sessions inform unified policy learning?
- What status categories best represent user goal progress without penalizing external failures?
- Why do abstract semantic memories outperform specific interaction histories for journey discovery?
- Does task superposition explain how models learn from multiple in-context trajectories?
- Can checklist-based rewards fix judgment problems in RL training?
- What behavioral changes occur during reward learning training?
- How do outcome and process rewards differ in their treatment of intermediate steps?
- What role does natural language play in breaking reinforcement learning performance plateaus?
- Why does storing past judgments in memory make current evaluations worse?
- How do intrinsic motivation principles explain why generating novel challenges improves learning?
- Why do agents report success when they have actually failed at tasks?
- Can importance sampling reduce variance in off-policy reward estimation?
- What makes trajectory more actionable than absolute scores for human moderators?
- Why does natural language feedback break performance plateaus that numerical rewards alone cannot?
- Does task ordering affect multi-task reinforcement learning outcomes?
- How do developmental curriculums emerge from learning progress signals?
- Can population diversity in self-improvement prevent error avalanching failures?
- What breaks when you apply reinforcement learning after supervised fine-tuning?
- How do implicit world models and self-reflection operationalize consequence-based learning?
- How should learning environments balance error prevention with pedagogical value?
- Do outcome-only reward signals miss step-level errors that compound later?
- Can episodic and semantic memory improve long-horizon task reasoning?
- Does AI-assisted performance transfer to independent task completion?
- How does forced exploration through diversity rewards differ from suppression-based negative reinforcement?
- What makes session-aware multi-turn tracking necessary for asynchronous training?
- What training difficulty and curriculum settings prevent instability in empathetic agent RL?
- Can AI outputs inspire new directions even when they seem like failures?
- Can step-level rewards improve training of agentic retrieval systems?
- How does process-focused feedback compare to outcome-focused feedback in skill training?
- How does modularity in reward and policy design enable goal generalization?
- Can combinational creativity alone drive open-ended learning in agents?
- Why do memory and feedback loops matter more than model size for agent reliability?
- How does dual-rate learning separate episodic and procedural memory in neural networks?
- How does explicit exploratory prompting compare to fine-tuned reinforcement learning for in-context adaptation?
- Why does entropy-based frame sampling work better than uniform stride selection?
- Can episodic memory alone enable learning without parameter updates?
- Is reward propagation in RL formally dual to cause inference in memory?
- Do depth thresholds correspond to transitions between procedural and strategic learning?
- How do retrieved memories differ from decision-context passages for prediction?
- How do loss functions simultaneously shape both learning and decision quality?
- Can episodic memory of UI traces improve open-world agent adaptation?
- Can negative feedback through critiques achieve the same steering flexibility as positive preferences?
- How can agents learn when silence is better than intervention?
- Can negative reinforcement alone match full RL performance on domain tasks?
- What makes behavioral cloning produce more persuadable but less aligned agents?
- Can models learn both what and how to study through reinforcement learning?
- What happens when error accumulation and preference signal collapse occur together?
- What details do high-level trajectory abstractions lose that state-grounded recall preserves?
- What makes abstention a learnable behavior instead of a default penalty?
- Can multi-turn aware rewards improve alignment beyond single-turn helpfulness?
- How should trajectory-aware PRMs weight backtracking and planning sentences?
- What deployment modes work best for trajectory-aware reward signals?
- Does environment stochasticity force models to generalize better across trajectory variations?
- Why do completion-mode strengths not transfer to agentic settings?
- How do delayed effects complicate causal attribution in agent systems?
- What failure modes do imitation and outcome methods each address?
- Can tool-call advantage attribution distinguish between correct and incorrect calls in mixed trajectories?
- Why do sparse outcome rewards fail to credit correct tool calls in failed trajectories?
- How do complete multi-turn trajectories differ from isolated task examples?
- How do agents learn to report success on actions that actually failed?
- What training objectives could reduce completion bias in autonomous agents?
- How do trajectory quality and memory hygiene differ as evaluation metrics?
- Can agents compress long trajectories without losing critical decision context?
- Why do successful and failed trajectories need different memory processing?
- Can memory consolidation fragility be detected and reversed during execution?
- Can individual skills improve through reuse and accumulate experience across tasks?
- Can applicability conditions be preserved automatically when agents reflect on trials?
- Can binary judge feedback replace external reward signals for skill learning?
- Can neural modules memorize surprising tokens as adaptive long-term memory?
- What makes memory consolidation fragile compared to raw trajectory storage?
- Why do standard process reward models struggle with branching reasoning traces?
- What makes preventative lessons from failures more valuable than success patterns?
- How does memory folding enable agents to reconsider strategies mid-task?
- When should agents stop recursing to optimize success versus cost?
- How do prior errors in context history amplify future mistakes in long tasks?
- Can offline recurrent passes replicate sleep-based memory consolidation in AI?
- Can in-context reinforcement learning match human sample efficiency on real problems?
- What behavioral differences emerge from symmetric versus asymmetric peer discussion loops?
- What distinguishes working memory from strategic memory in agent task execution?
- Why do current metacognitive training loops fail when agents encounter new domains?
- How does curriculum learning prevent instability in social-emotional RL training?
- Why do single-turn RL methods fail to generalize to multi-turn tasks?
- How should multi-objective post-training balance competing behavioral goals?
- How do you extract reward signals when all rollouts fail?
- Can graph topology represent successful trajectory clusters more effectively than skill libraries?
- What drives the choice between storing raw episodes versus abstracted rules?
- How do complementary learning systems explain the need for fast and slow consolidation?
- Could activation sparsity signal task difficulty and guide routing decisions?
- How does SDPO relate to agents learning from verbal reflection without parameter updates?
- How does in-context feedback integration differ from learned reward signals?
- How do failure examples improve distillation compared to successful trajectories alone?
- Can early experience replace external rewards as a learning signal?
- How do prior errors in context history amplify future failures over time?
- Why does information asymmetry between teacher and student enable effective feedback learning?
- Does careful reward engineering matter if pretraining determines RLVR effectiveness?
- Can held-out validation gates prevent optimizer hallucinations in skill proposals?
- How can a forgetting policy preserve rare knowledge while preventing over-generalization?
- Why does gradient discarding limit standard policy clipping?
- How do agents decide when to stop and reflect on failure?
- What can agents learn from the brain's complementary learning systems?
- What makes trajectory quality matter more than one-shot task success?
- What makes knowledge seeding equivalent to hippocampal replay in the brain?
- Why does negative experience transfer better than positive examples alone?
- How should agents compress episodic interactions into working memory without accumulation?
- What hidden signals in agent logs reveal about frontier capability beyond pass-fail outcomes?
- Can trajectory structure replace hand-annotated process reward models entirely?
- How does active selection of training content differ from random reinforcement sampling?
- What makes content informative and not-yet-mastered for reinforcement during pretraining?
- Can agents escape weak belief tracking and conservative action selection traps?
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does agent memory degrade when continuously consolidated?
Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.
diagnoses the failure mode SkillRL's differential processing addresses
-
Can agents learn better from their failures than successes?
Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.
ReasoningBank also distills from successes AND failures but treats both as strategies; SkillRL treats successes as demonstrations (raw) and failures as lessons (abstracted) — same idea applied with different granularity
-
Can frozen language models continually improve through memory structure alone?
If agents can't update parameters, what form of textual memory lets them keep learning across trials and transfer to new tasks without retraining?
CLIN preserves applicability conditions via causal form; SkillRL preserves them by treating success-trajectories as raw
-
Can agents learn reusable sub-task routines from past experience?
Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.
AWM compounds workflows; SkillRL compounds skills hierarchically — same compositional principle, asymmetric trajectory processing as added axis
-
Can agents learn new skills without forgetting old ones?
Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.
VOYAGER is the predecessor; SkillRL adds the success-failure asymmetry and online RL refinement
-
Can a separate trained curator improve skill libraries better than frozen agents?
Explores whether decoupling skill curation from agent execution enables better long-term learning of what skills to keep, delete, or refine. Matters because manual curation doesn't scale and heuristic approaches lack feedback.
SkillOS is the complementary axis: SkillRL differentiates *what gets stored* (success demos vs failure lessons); SkillOS differentiates *who learns from the storage* (curator vs executor). SkillRL's asymmetric trajectory processing is a candidate ingredient inside SkillOS's curator
-
Can agents adapt without pausing service to users?
Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.
MetaClaw decomposes adaptation across timescales using SkillRL-like failure-distillation as its fast-timescale mechanism; MetaClaw's contribution is adding the slow-timescale weight-update channel
-
Does creating skills inside the agent loop eliminate mismatches?
Can coupling skill creation directly to the runtime reasoning loop—rather than authoring skills offline—close the gap between when skills are made and when they're used? This matters for whether agents can ground new capabilities in their actual situated context.
synthesizes: both ground skills in the agent's own situated trajectory rather than out-of-loop authoring, here via in-loop creation, there via differential trajectory processing
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
- AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
- ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
- SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
- Useful Memories Become Faulty When Continuously Updated by LLMs
- Self-distillation Enables Continual Learning
- Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
- Artifacts as Memory Beyond the Agent Boundary
Original note title
recursive skill-augmented RL applies differential processing to trajectories — successful episodes preserved as demonstrations while failures distilled into concise lessons