INQUIRING LINE

Inquiring lines›How does AI reshape human reasonin…›How do scale, context, and measure…›How should memory consolidation st…›this inquiring line

AI agents learn more by filing wins as templates to copy and losses as extracted lessons — not logging both the same way.

Why do successful and failed trajectories need different memory processing?

This explores why agents that learn from their own experience seem to get more out of treating wins and losses asymmetrically — storing them in different forms rather than dumping both into memory the same way.

This explores why agents that learn from their own experience seem to get more out of treating wins and losses asymmetrically — storing successes in one form, failures in another, rather than logging both the same way. The corpus converges on a surprisingly consistent answer: a success and a failure carry different *kinds* of information, so compressing them identically throws away what each is actually good for.

The clearest statement is SkillRL's, which keeps successful episodes as concrete demonstrations — replay this, it worked — while distilling failures into abstracted lessons rather than verbatim transcripts Should successful and failed episodes be processed differently?. The logic is that a success is valuable as a *template* you can imitate, whereas a failure is valuable only as a *warning* — and a warning doesn't need the full play-by-play, just the principle that prevents the repeat. ReasoningBank reaches the same place from the memory side: storing strategy-level hints extracted from both successes *and* failures beats success-only memory and beats hoarding raw trajectories, because the useful residue of a failure is a strategy, not a recording Can agents learn better from their failures than successes?.

There's a second, sharper reason failures can't be stored like successes: failures contaminate. The self-conditioning work shows that when a model's own prior errors sit in its context, performance degrades non-linearly — the model starts imitating its mistakes, and scaling the model doesn't fix it Do models fail worse when their own errors fill the context?. So a failure left in raw, demonstration-shaped form is actively dangerous; it has to be transformed into something that teaches without modeling the bad behavior. Reflexion's trick is exactly this transformation — it converts a binary success/failure signal into a verbal self-diagnosis stored as episodic memory, and crucially keeps those reflections uncompressed so they stay usable, while the *trajectory* itself isn't what's replayed Can agents learn from failure without updating their weights?.

The asymmetry also shows up on the training side, not just memory. GRPO-RoC filters *positive* trajectories hard for quality — keep only the clean ones — while *preserving diverse failures* as negative signal, letting a 14B model reach frontier math performance Why do correct code trajectories teach models to tolerate errors?. If you filtered both symmetrically you'd lose the very diversity that makes failures informative. ReasonFlux-PRM makes the complementary point for reasoning traces: a failed step inside a thinking trace is often *informative exploration*, not an error to be penalized, so process-reward models that treat all deviation as wrong degrade badly Why do standard process reward models fail on thinking traces?.

What ties it together — and what you might not have expected — is that this mirrors how human experts actually reason: you remember a few successful moves vividly and concretely, but your failures collapse into general principles ("don't open with that"), not frame-by-frame memories. The machine result is that uniform consolidation isn't just inefficient, it's a category error — and the deeper hint from the memory-as-substrate work is that adaptive intelligence may *be* the structured, selective reuse of past inference, where what you choose to keep concrete versus abstract is the whole game Can cognition work by reusing memory instead of recomputing?.

Sources 7 notes

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can agents learn better from their failures than successes?

ReasoningBank shows that storing strategy-level reasoning hints from both self-judged successes and failures outperforms success-only memory and raw trajectory storage. Coupled with test-time scaling, memory and compute compound rather than substitute, creating a novel scaling law where accuracy improves through cumulative interaction history.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Why do correct code trajectories teach models to tolerate errors?

GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.

Show all 7 sources

Why do standard process reward models fail on thinking traces?

Standard PRMs degrade on trajectory format because thinking traces include branching, backtracking, and weaker coherence than polished responses. ReasonFlux-PRM addresses this by supervising both trajectories and responses, treating failed steps as informative exploration rather than errors.

Can cognition work by reusing memory instead of recomputing?

Memory-Amortized Inference proposes intelligence arises from structured reuse of prior inference paths over topological memory, inverting RL's reward-forward logic into cause-backward reconstruction. This duality explains energy efficiency and suggests memory trajectories form the substrate of adaptive thought.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Useful Memories Become Faulty When Continuously Updated by LLMs3.39 match · arxiv ↗
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs2.57 match · arxiv ↗
RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents2.53 match · arxiv ↗
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory1.74 match · arxiv ↗
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs1.72 match · arxiv ↗
The AI Hippocampus: How Far are We From Human Memory?1.69 match · arxiv ↗
rStar2-Agent: Agentic Reasoning Technical Report1.68 match · arxiv ↗
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver1.68 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst auditing claims about asymmetric memory processing in agentic RL and reasoning systems. The question remains open: *why* and *when* do agents benefit from storing successes and failures differently?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as period snapshots:
• SkillRL (2025): successful episodes stored as concrete demonstrations; failures compressed into abstracted lessons only. Template vs. warning (SkillRL, ~2025).
• ReasoningBank (2025–09): strategy-level hints extracted from both successes *and* failures outperform success-only memory; raw trajectories are wasteful (ReasoningBank, 2025–09).
• Self-conditioning effect: model priors placed in context degrade performance non-linearly; raw failures actively contaminate via imitation (cited ~2024–2025).
• Reflexion mechanism: binary success/failure → verbal self-diagnosis stored episodic, uncompressed; trajectory itself not replayed (Reflexion, implicit 2024–2025).
• GRPO-RoC (2025): filter positive trajectories for quality; preserve diverse failures as negative signal. Asymmetric filtering, not symmetric (GRPO-RoC, ~2025).
• ReasonFlux-PRM (2026–06): process-reward models treating all deviation as error degrade; failed steps in thinking traces are often informative exploration, not penalties.
• Memory-amortized inference (2025–08): cognition as navigation over constrained latent structures; what you keep concrete vs. abstract is the substrate of adaptation (Beyond Turing, 2025–08).

Anchor papers (verify; mind their dates):
• arXiv:2509.25140 ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory (2025–09)
• arXiv:2506.18896 ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs (2026–06)
• arXiv:2508.14143 Beyond Turing: Memory-Amortized Inference as a Foundation for Cognitive Computation (2025–08)
• arXiv:2605.28773 Rethinking Memory as Continuously Evolving Connectivity (2026–05)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above — asymmetric storage, self-conditioning contamination, uncompressed reflection, asymmetric trajectory filtering, process-reward deviation handling — judge whether newer models (frontier LLMs, post-2026 agentic systems), improved memory substrates (KV caches, learned routing, dynamic attention), orchestration (multi-agent handoff, hierarchical memory), or evaluation harnesses have since RELAXED or OVERTURNED it. Separate the durable question (likely: *when* asymmetry helps vs. when uniform storage suffices?) from perishable limitations (e.g., "self-conditioning breaks performance" may be soluble via isolation, masking, or architectural change). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers claiming symmetric processing works equally well, unified memory schemes, or evidence that failure storage as raw trajectory is safe under certain conditions (e.g., large-scale supervised finetuning, post-hoc pruning, or retrieval ranking).
(3) Propose 2 research questions that ASSUME the regime may have moved:
   — Does the asymmetry remain necessary as context windows expand and retrieval becomes cheaper?
   — Can a learnable routing policy (e.g., attention-based selection) replace hand-coded asymmetry, and if so, does it recover the same asymmetric structure?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

AI agents learn more by filing wins as templates to copy and losses as extracted lessons — not logging both the same way.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8