Can reinforcement learning scale beyond single-turn language tasks?
Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.
Most RL applications for LLMs have been limited to single-turn tasks — math reasoning, single-shot code generation — which are degenerate MDPs with no intermediate environmental feedback. Software engineering is categorically different: agents must manage stateful, multi-turn interactions across dozens of steps with context windows spanning hundreds of thousands of tokens, interpreting rich feedback (compiler traces, test logs) at each step.
Using a modified DAPO algorithm, training Qwen2.5-72B-Instruct doubles SWE-bench Verified success from a 20% rejection-finetuned baseline to 39%, matching or surpassing larger models like DeepSeek-V3 and Qwen3-235B. The key challenges addressed include long-horizon credit assignment with sparse delayed rewards, complex informative feedback interpretation, and expensive noisy evaluation.
This matters because it validates that RL's benefits extend beyond the "token-level MDP" framing where most current work operates. Since Can full episode rewards per step enable better credit assignment?, RL for SWE confirms that multi-step credit assignment is not just theoretically sound but practically achievable at scale. And since Does limiting reasoning per turn improve multi-turn search quality?, the SWE result suggests that RL training can learn the step-level discipline that inference-time limiting imposes.
The interaction structure of SWE — actions producing observable transitions and verifiable outcomes — may be what makes RL feasible here, whereas domains without such structure may remain harder to train.
Inquiring lines that use this note as a source 20
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do LLM user simulators track and maintain consistent goal states across multi-turn interactions?
- What makes some tasks bounded enough for reliable RL?
- Can multi-turn rewards fix models that lose track midway?
- Why does multi-turn RL generate orders of magnitude more tokens than single-turn?
- What makes session-aware multi-turn tracking necessary for asynchronous training?
- Can multi-turn reinforcement learning improve tool use in language models?
- How does single-turn training undermine multi-turn strategic dialogue?
- What specific metrics distinguish single-turn versus multi-turn collaboration success?
- Why do next-turn reward objectives fail to encourage multi-turn goal progress?
- Does RL refine existing knowledge or discover entirely new capabilities?
- What makes software engineering environments better suited for RL than other interactive domains?
- Why does RL improve sampling efficiency but not expand capability boundaries?
- What distinguishes RL that creates new capabilities from RL that merely teaches timing?
- How does single-turn optimization undermine multi-turn collaborative dynamics?
- Why do LLMs fail at directly solving stochastic control problems?
- How do self-evolving curricula help RL break beyond base model capability boundaries?
- How do complete multi-turn trajectories differ from isolated task examples?
- Why do single-turn RL methods fail to generalize to multi-turn tasks?
- What training duration is actually needed for RL to expand capabilities?
- Can RL directly optimize attention distributions instead of text generation?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can full episode rewards per step enable better credit assignment?
Can attributing cumulative episode reward to every step in a trajectory, rather than discounting by step distance, actually solve credit assignment in sequential LLM decision-making? This challenges intuitive RL assumptions about how credit should flow backward through time.
complements: MS-GRPO formalizes sequential credit assignment, SWE validates it at scale
-
Does limiting reasoning per turn improve multi-turn search quality?
When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?
connects: SWE RL learns per-turn discipline through training rather than inference-time limiting
-
Can RL training run while generation continues without waiting?
Synchronous RL systems waste compute time waiting for slow generation steps. Can training and generation truly decouple while maintaining performance on reasoning tasks?
enables: AReaL's infrastructure makes this scale of multi-turn RL training practical
-
Why do correct code trajectories teach models to tolerate errors?
Explores why standard outcome-based RL fails for code tool use: when models receive reward for correct final answers despite intermediate code errors, they learn that mistakes are acceptable, producing poor reasoning quality.
complementary agentic RL challenge: SWE-RL addresses long-horizon credit assignment with sparse rewards, while rStar2-Agent addresses trajectory quality in code-tool environments — both tackle the noise that tool-using RL introduces (SWE-RL through modified DAPO, rStar2 through GRPO-RoC asymmetric filtering)
-
Can AI systems improve themselves through trial and error?
Explores whether replacing formal proof requirements with empirical benchmark testing enables AI systems to successfully modify and improve their own code iteratively, and what mechanisms prevent compounding failures.
alternative path to SWE capability: DGM achieves 50% SWE-bench via evolutionary self-modification without RL, while SWE-RL achieves 39% via RL training; DGM's evolutionary archive enables open-ended capability discovery that RL's reward optimization may not explore
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
- DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- TreeRL: LLM Reinforcement Learning with On-Policy Tree Search
- Teaching Large Language Models to Reason with Reinforcement Learning
- A Survey on Post-training of Large Language Models
- Intrinsic Credit Assignment for Long Horizon Interaction
- From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR
- The Art of Scaling Reinforcement Learning Compute for LLMs
Original note title
rl successfully scales to long-horizon multi-turn software engineering tasks doubling baseline performance