INQUIRING LINE

Inquiring lines›What enables authentic and grounde…›How should retrieval-augmented gen…›How can AI systems learn from fail…›this inquiring line

An AI trained only on successes never learns where it's about to go wrong — failures fill that gap.

How do failure examples improve distillation compared to successful trajectories alone?

This explores why teaching a model from its mistakes — not just its wins — produces a stronger student, and what specifically failures add that clean successful trajectories can't.

This reads the question as being about distillation in the broad sense — transferring reasoning from teacher (or past attempts) to student — and asking what failures contribute that a diet of correct trajectories alone leaves out. The short version the corpus keeps circling: successes teach you the move, failures teach you the boundary, and a student trained only inside the boundary doesn't know where it is.

The most direct evidence is that failures and successes carry *different kinds* of information and should be processed differently. ReasoningBank shows that storing strategy-level hints from both self-judged wins and losses beats success-only memory and beats dumping raw trajectories Can agents learn better from their failures than successes?. SkillRL makes the asymmetry explicit: keep successes as concrete demonstrations, but abstract failures into lessons — uniformly consolidating both actually degrades learning Should successful and failed episodes be processed differently?. And GRPO-RoC filters positive trajectories hard for quality while deliberately *preserving* diverse failures as negative signal — that asymmetry is what let a 14B model reach frontier math performance, because clean-only positives quietly teach the model to tolerate the errors hiding inside otherwise-correct code traces Why do correct code trajectories teach models to tolerate errors?.

Why do clean trajectories alone fall short? Because the messy parts — the wrong turns, the backtracking, the hesitation — are themselves the skill being transferred. Stream of Search pretraining on full search processes including mistakes scored 25% higher than training on optimal trajectories only; the model learns to explore and recover rather than to recite a fixed path Does training on messy search processes improve reasoning?. The flip side is a warning: self-distillation that polishes traces into confident brevity strips out the "Wait" and "Hmm" tokens that flag a flawed path, and removing those uncertainty markers wrecks robustness on out-of-distribution problems Does self-distillation harm mathematical reasoning performance?. A richer, answer-conditioned teacher produces exactly these overconfident short traces — strong in-domain, brittle outside it Does richer teacher context hurt student generalization?. Failures are where the model learns epistemic caution; sand them away and you get a fluent student that can't tell when it's wrong.

The interesting catch — the thing you might not have known to ask — is that not all failures are equal. Failures only help when the model could plausibly have succeeded. Training on near-impossible problems backfires: group-relative normalization treats rare accidental wins as high-advantage, and the model learns degenerate shortcuts that then contaminate skills it already had Do overly hard RLVR samples actually harm model capabilities?. So the value of a failure example is conditional on its being *recoverable* and *legible*. The systems that win don't just include failures — they route each one through a decision: what does this teach, and is it teachable? AutoResearchClaw's pivot-or-refine loop turns every failure into a structured next-attempt signal rather than a dead end Can experiment failures drive progress instead of stopping it?.

So failures improve distillation along three axes successes can't cover: they mark the boundaries of competence, they preserve the exploration-and-recovery behavior that is the actual reasoning skill, and they keep the uncertainty signals that let a student self-correct off-distribution. The discipline is asymmetry — distill successes as demonstrations, failures as abstracted lessons, and discard the failures that were never winnable in the first place.

Sources 8 notes

Can agents learn better from their failures than successes?

ReasoningBank shows that storing strategy-level reasoning hints from both self-judged successes and failures outperforms success-only memory and raw trajectory storage. Coupled with test-time scaling, memory and compute compound rather than substitute, creating a novel scaling law where accuracy improves through cumulative interaction history.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Why do correct code trajectories teach models to tolerate errors?

GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.

Does training on messy search processes improve reasoning?

Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.

Does self-distillation harm mathematical reasoning performance?

Self-distillation reduces performance in mathematical reasoning by eliminating epistemic markers like "Wait" and "Hmm" tokens that flag flawed reasoning paths. These tokens enable self-correction on out-of-distribution problems, so removing them sacrifices robustness for confident brevity.

Show all 8 sources

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can experiment failures drive progress instead of stopping it?

AutoResearchClaw's pivot-or-refine loop routes every failure through a decision process, making failure inform the next attempt rather than stop execution. Component ablation shows this mechanism drives completion and is distinct from reasoning or verification.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents3.37 match · arxiv ↗
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs2.49 match · arxiv ↗
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning2.43 match · arxiv ↗
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory1.74 match · arxiv ↗
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?1.74 match · arxiv ↗
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs1.69 match · arxiv ↗
rStar2-Agent: Agentic Reasoning Technical Report1.68 match · arxiv ↗
Useful Memories Become Faulty When Continuously Updated by LLMs1.68 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning systems analyst. The durable question: **what information do failure examples carry that success-only training leaves out, and how should that asymmetry shape distillation design?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as a snapshot, not current ground truth.

• Failures and successes encode *different* information and require asymmetric processing: ReasoningBank (2025-09) shows strategy-level hints from both wins and losses outperform success-only memory; SkillRL (2025-09) explicitly abstracts failures into lessons while keeping successes as concrete demos, because uniform consolidation degrades learning.

• Exploration-and-recovery behavior embedded in messy trajectories is the actual reasoning skill. Stream of Search (2024-04) trained on full search including mistakes scored 25% higher than optimal-only training; self-distillation that polishes traces into confident brevity strips uncertainty markers ("Wait", "Hmm"), wrecking robustness on OOD problems (2026-03).

• Not all failures help equally. RLVMR (2025-07) shows overly-hard samples induce degenerate shortcuts; failures only improve learning when recoverable and legible — the model must plausibly have succeeded.

• Structured routing matters. AutoResearchClaw (2026-05) treats experiment failures as pivot-or-refine signals rather than dead ends, turning each failure into a teachable next-attempt.

Anchor papers (verify; mind their dates):
• arXiv:2404.03683 Stream of Search (2024-04)
• arXiv:2509.25140 ReasoningBank (2025-09)
• arXiv:2603.24472 Why Does Self-Distillation Degrade Reasoning? (2026-03)
• arXiv:2605.20025 AutoResearchClaw (2026-05)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For the claim that failures teach boundaries while successes teach moves: has emergence of newer model scales, in-context learning, or oracular verifiers (e.g., process reward models, outcome verifiers) since mid-2026 *relaxed* the need for explicit failure distillation? Do frontier models now extract boundary information from success traces alone, or do they still require negative signal? Test the 25% gain from exploration-inclusive pretraining — does it hold on current benchmarks, or has curriculum learning or better sampling made raw noise a wash? Separately: does the uncertainty-stripping finding still hold, or have recent tokenization/training methods (e.g., reasoning tokens, latent spaces) preserved epistemic signals without explicit "Hmm" tokens?

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Look for papers claiming success-only distillation now suffices (via better filtering, bigger teachers, or architectural changes), or arguing that failure distillation introduces more noise than signal under modern RL setups. Flag any evidence that the asymmetry story is overstated.

(3) **Propose 2 research questions that ASSUME the regime may have moved:**
   - Does adversarial or hard-negative mining (e.g., in-distribution near-misses) now obsolete the need to preserve "messy" search traces, since curated negatives are more efficient?
   - In agentic or multi-step settings, does failure-driven refinement (pivot-or-refine) still outweigh end-to-end RL on success-only trajectories, or have better credit assignment and long-horizon RL eroded the gap?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

An AI trained only on successes never learns where it's about to go wrong — failures fill that gap.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8