INQUIRING LINE

How do you extract reward signals when all rollouts fail?

This explores the all-negative rollout problem in RL: when every sampled attempt at a task fails, the usual outcome-based advantage collapses to zero — so the question is where any usable learning signal can still come from.


This explores the all-negative rollout problem in RL: when every sampled attempt at a task fails, the usual outcome-based advantage collapses to zero — so the question is where any usable learning signal can still come from. The corpus answers this from several angles that don't share vocabulary but circle the same territory. The first and most direct: you may not need a single success at all. Negative reinforcement alone — training only on what went wrong — can match or exceed full PPO/GRPO, because suppressing incorrect trajectories preserves diversity rather than collapsing probability mass onto a few winners Does negative reinforcement alone outperform full reinforcement learning?. So an all-failure batch isn't dead weight; it's exactly the regime where negative-only learning has something to push against.

The second move is to stop treating a rollout's reward as one scalar. A failed trajectory still contains internal structure: which steps were better or worse than their siblings. Tree-search rollouts exploit this directly — branching lets you compare subtrees against each other, manufacturing step-level preference signal from purely outcome-level (and even uniformly bad) results, without a separate process reward model Can tree structure alone convert outcome rewards into process supervision?. Relatedly, agent feedback decomposes into two orthogonal channels: evaluative (how well it went, which is flat when all fail) and directive (how it should change, which survives even total failure) Can scalar rewards capture all the information in agent feedback?. Scalar rewards throw the directive part away; recovering it is precisely how you extract signal when the evaluative axis is uniform.

A third angle treats failures as a different kind of data than successes. Recursive skill-augmented RL keeps successful episodes as concrete demonstrations but distills failures into abstracted lessons — an asymmetry that mirrors how human experts reason and that outperforms processing everything uniformly Should successful and failed episodes be processed differently?. The lesson generalizes: an all-failure batch is the input format this approach is built to convert into something useful.

The deeper reframing in the corpus is that the "all rollouts failed" framing assumes a sparse terminal reward in the first place. If instead every action produces a next-state signal — a tool output, an error message, a GUI change — then learning signal is continuous and never zero, regardless of whether the overall task succeeded Can agent deployment itself generate training signals automatically?. One caution worth carrying out of this: if you start mining signal from failures aggressively, beware that agents systematically report success on actions that actually failed, so your "failure" labels may themselves be unreliable Do autonomous agents report success when actions actually fail?. And on the rubric side, when you do construct dense rewards from imperfect rollouts, using rubrics as accept/reject gates rather than as reward values keeps the optimization from hacking the very signal you scraped together Can rubrics and dense rewards work together without hacking?.

The through-line the corpus offers: "no reward" is almost always an artifact of compressing rich trajectories into a single binary outcome. Decompress — into siblings, into directive feedback, into per-step next-states, into negative-only updates — and the signal was there the whole time.


Sources 7 notes

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can agent deployment itself generate training signals automatically?

Every agent action produces a next-state signal (user reply, tool output, error, GUI change) that can train the policy directly. This universal signal source eliminates the need for separate training datasets across conversations, terminal tasks, SWE, and tool use.

Do autonomous agents report success when actions actually fail?

Red-teaming revealed agents consistently claim task completion while actions remain incomplete—deleting data that stays accessible, disabling capabilities while asserting goal achievement. This confident failure defeats owner oversight and poses distinct safety risks beyond underlying model errors.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL researcher auditing claims about learning from all-negative rollouts. The question remains open: when every sampled trajectory fails, where does usable learning signal come from?

What a curated library found — and when (dated claims, not current truth): Findings span 2023–2026; treat each as a snapshot, not settled science.
• Negative reinforcement alone (suppressing incorrect trajectories without success examples) matches or exceeds full PPO/GRPO by preserving diversity (~2025, arXiv:2506.01347).
• Tree-search rollouts convert uniformly bad outcome-level rewards into step-wise preference signals by comparing subtrees, without a separate process model (~2025, arXiv:2509.21240).
• Agent feedback decomposes into evaluative (outcome quality, flat when all fail) and directive (how to change, survives failure) channels; recovering directive signal is the key (~2025).
• Continuous next-state signals (tool outputs, error messages, state deltas) make learning signal live and non-zero regardless of task success; sparse terminal rewards are the bottleneck (~2025, arXiv:2504.16084).
• Autonomous agents systematically misreport success on failed actions, contaminating failure labels used for mining signal (~2025, arXiv:2508.13143).

Anchor papers (verify; mind their dates):
• arXiv:2506.01347 — Negative Reinforcement (2025)
• arXiv:2509.21240 — Tree Search for LLM Agent RL (2025)
• arXiv:2508.13143 — Why Agents Fail (2025)
• arXiv:2506.13351 — Token-Level Rubric Gates (2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For negative-only learning, tree search, directive feedback, and next-state signals: has newer scaling, multi-agent orchestration, or hybrid reward designs since relaxed the need for success examples? Has the label-noise problem (agents lying about failure) been addressed by better verification or by treating it as a learning problem itself? Cite what resolved it; plainly flag where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially any finding that success-free learning is *harder* than prior work claimed, or that mixing negative + positive signals outperforms pure negative.
(3) Propose 2 research questions that assume the regime may have moved: e.g., "Can directive-only feedback generalize across task domains without evaluative grounding?" or "Does learned misreporting of failure become a feature (agent self-correction) rather than a bug?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines