INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do different training strategi…›What pretraining choices and basel…›this inquiring line

Training an AI to answer in one shot makes it surprisingly bad at tasks that unfold across many exchanges — but why?

Why do single-turn RL methods fail to generalize to multi-turn tasks?

This explores why RL tuned for one-shot answers struggles when a task unfolds over many turns — and the corpus suggests the failure isn't about turn count itself, but about credit assignment, context budgeting, and diversity collapse.

This explores why RL tuned for one-shot answers struggles when a task unfolds over many turns. The first surprise in the corpus is that the premise has a big exception: RL *can* scale to long, stateful, multi-step work. A modified DAPO recipe doubled software-engineering performance (20% → 39%) in environments with delayed rewards and complex feedback, exactly the conditions a single-turn MDP assumption ignores Can reinforcement learning scale beyond single-turn language tasks?. So the better question is: what specifically breaks when you naively port single-turn methods over — and what had to be fixed to make them work?

The clearest culprit is the reward signal. Single-turn RL leans hard on clean, verifiable, binary rewards, and that's exactly where it shines — gains can jump from under 1% to over 70% when the answer is checkable, while fuzzy judgment-based signals barely move the needle Why does RL succeed more on some tasks than others?. Multi-turn tasks rarely hand you a crisp end-of-turn verdict; reward is delayed and diffuse across a trajectory. Worse, binary correctness rewards quietly teach overconfidence, since they never punish a confident wrong answer — a pathology that compounds across turns and is only fixed by adding a calibration term like the Brier score Does binary reward training hurt model calibration?.

The second culprit is context economics, which simply doesn't exist in single-turn framings. Letting a model reason freely is fine for one answer, but in an iterative loop that reasoning eats the context budget needed for later retrieval rounds, eroding the agent's ability to absorb new evidence. Capping reasoning *per turn* — not just overall — is what preserves multi-turn search quality Does limiting reasoning per turn improve multi-turn search quality?. Relatedly, not all turns are the same kind of work: RL training tends to move through a two-phase dynamic where execution correctness is the early bottleneck and strategic planning becomes the later one, so a method that only optimizes 'get this step right' never learns the planning that multi-turn success actually hinges on Does RL training follow a predictable two-phase learning sequence?.

The third culprit is diversity collapse. RL pulls policies toward a narrow band of reward-maximizing behavior through entropy collapse — and this isn't unique to reasoning; search agents show the very same squeeze, with SFT on diverse demonstrations needed to keep exploration breadth alive Does reinforcement learning squeeze exploration diversity in search agents?. In a single turn, narrowing to the best move is mostly fine. Across many turns, exploration is the whole game, so the same mechanism that helps short tasks actively sabotages long ones. Training order can even be tuned to manage this, since structured and open-ended domains push entropy in opposite directions Does training order reshape how models handle different task types?.

What ties this together is that the fixes for multi-turn are all about treating the trajectory as the unit, not the turn. SkillRL, for instance, processes successes and failures asymmetrically — keeping wins as concrete demonstrations and distilling losses into abstract lessons — which is precisely the kind of credit assignment a single-turn objective can't express Should successful and failed episodes be processed differently?. The thing you didn't expect to learn: single-turn RL doesn't fail because tasks got longer, but because its assumptions about *where reward lives, how context is spent, and how much exploration to keep* all quietly stop holding the moment a task has a second turn.

Sources 8 notes

Can reinforcement learning scale beyond single-turn language tasks?

Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.

Why does RL succeed more on some tasks than others?

Binary verifiable rewards enable dramatic RL gains (0.15% to 73.98%), while judgment-based evaluation yields modest improvements (55% reduction). Clear reward signals unlock suppressed capabilities; fuzzy signals barely move the needle.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Show all 8 sources

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR3.34 match · arxiv ↗
Reinforcement Learning with Rubric Anchors2.48 match · arxiv ↗
RAGEN-2: Reasoning Collapse in Agentic RL2.48 match · arxiv ↗
RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents1.69 match · arxiv ↗
The Art of Scaling Reinforcement Learning Compute for LLMs1.68 match · arxiv ↗
Teaching Large Language Models to Reason with Reinforcement Learning1.67 match · arxiv ↗
A Survey on Post-training of Large Language Models1.67 match · arxiv ↗
Intrinsic Credit Assignment for Long Horizon Interaction1.67 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL researcher evaluating whether single-turn RL's failure to generalize to multi-turn tasks is a solved problem or a persistent constraint. The question: *what specifically breaks when you port single-turn RL methods to multi-turn environments, and has that actually been fixed?*

What a curated library found — and when (findings span 2024–2026; these are dated claims, not current truth):

• Reward signal brittleness: binary rewards work brilliantly on checkable single-turn tasks (70%+ gain vs. <1% baseline) but degrade calibration across trajectories unless augmented with proper scoring rules like Brier score (~2024–2025).
• Context economics: free reasoning per turn exhausts context budget needed for multi-turn retrieval and evidence absorption; capping reasoning *per turn*, not just globally, preserves search quality (~2025).
• Exploration collapse under RL: policies narrow to reward-maximizing modes, squeezing diversity; single-turn tasks tolerate narrowing, multi-turn tasks require exploration breadth—SFT on diverse demonstrations and entropy-aware training order can restore it (~2025–2026).
• Two-phase RL dynamics: procedural correctness (execution) is the early bottleneck; strategic planning emerges later; naive RL that only optimizes 'get this step right' never learns the planning multi-turn success requires (~2025).
• Trajectory-level credit assignment works: asymmetric processing (wins → concrete demos, losses → abstract lessons) handles multi-turn credit assignment that single-turn objectives cannot express (~2024–2025).

Anchor papers (verify; mind their dates):
- arXiv:2508.03501 (2025-08) Training Long-Context, Multi-Turn Software Engineering Agents with RL
- arXiv:2505.24864 (2025-05) ProRL: Prolonged RL Expands Reasoning Boundaries
- arXiv:2606.01347 (2025-08) RLVMR: Verifiable Meta-Reasoning Rewards for Long-Horizon
- arXiv:2605.22817 (2026-05) Vector Policy Optimization: Training for Diversity

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above—reward brittleness, context economics, exploration collapse, two-phase dynamics, credit assignment—judge whether newest models (2026–present), training methods (PPO variants, DPO-style objectives), inference harnesses (multi-agent orchestration, persistent memory), or evals have since relaxed or overturned it. Separate the durable question (e.g., *how do you credit a multi-step trajectory?*) from the perishable limitation (e.g., *binary rewards break calibration*—is this still true with modern rubric-anchored or outcome-supervised approaches?). Cite what resolved each constraint, and say plainly where it still seems to hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—any paper claiming single-turn methods *do* generalize, or that the fixes above are themselves insufficient or misconceived.
(3) Propose 2 new research questions that ASSUME the regime may have shifted: e.g., *If trajectory-level credit assignment is solved, what's the next bottleneck?* or *Do multi-turn RL gains persist when tasks are adversarially reordered?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training an AI to answer in one shot makes it surprisingly bad at tasks that unfold across many exchanges — but why?

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8