Can RL training run while generation continues without waiting?
Synchronous RL systems waste compute time waiting for slow generation steps. Can training and generation truly decouple while maintaining performance on reasoning tasks?
Synchronous RL systems for large reasoning models alternate strictly between generation and training, ensuring models always train on their latest outputs. But this design creates severe inefficiency: the generation step must wait for the longest output in a batch, and LRMs produce wildly varying output lengths — tens of thousands of thinking tokens for some prompts, few hundred for others.
AReaL fundamentally resolves this by making RL training fully asynchronous. Each rollout worker continuously generates outputs without waiting (streaming generation). Trainer workers run parallel model updates whenever a training batch is available. After each update, model weights are synchronized to rollout workers. The critical consequence: each training batch may contain samples generated by different model versions.
To make this work, AReaL incorporates a modified PPO objective that can leverage samples from much older model versions without performance loss. This is a significant departure from the conventional wisdom that on-policy data (from the latest model) is essential for RL training quality. Prior semi-asynchronous systems limited version staleness to one or two steps and still used batched generation from a single version.
This is an infrastructure insight with capability implications. Since Can reinforcement learning scale beyond single-turn language tasks?, and since multi-turn RL generates orders of magnitude more tokens than single-turn, the efficiency gains from asynchronous training are not merely convenient but potentially necessary for scaling RL to interactive environments.
The broader principle: when the generation-training bottleneck is resolved, the practical frontier of what RL can train on expands considerably — from single-turn math to multi-turn interactive tasks that require long context and many steps.
Extension (OpenClaw-RL, 2026): The OpenClaw-RL framework pushes async decoupling further — from the 2-loop generation/training split to a 4-loop architecture where policy serving, rollout collection, PRM judging, and policy training run as four independent loops with zero blocking dependencies. This is built on slime and adds session-aware multi-turn tracking, graceful weight updates, flexible PRM support, and large-scale environment parallelization. The crucial extension is conceptual, not just architectural: AReaL assumes batch data collection even while async; OpenClaw-RL makes the serving loop itself the data source. The same infrastructure that responds to users in production simultaneously generates the training signal. Personal agents improve simply by being used. The async decoupling pattern has now generalized from "compute-efficient training" to "continuous learning from live deployment" — where the serving/training boundary dissolves entirely. See Can agent deployment itself generate training signals automatically? for the signal-recovery framing that makes this practical.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes session-aware multi-turn tracking necessary for asynchronous training?
- What interference occurs when planning and synthesis happen in the same component?
- What infrastructure decouples generation from training in asynchronous agent loops?
- How does the rate of generation outpace archival of outputs?
- What training duration is actually needed for RL to expand capabilities?
- How does prolonged RL training differ from standard RLVR approaches?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can reinforcement learning scale beyond single-turn language tasks?
Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.
enables: asynchronous training makes the compute requirements of multi-turn RL practical
-
Can two simple techniques match complex RL algorithms?
Does vanilla PPO with minimal modifications rival more sophisticated reasoning algorithms like GRPO and DAPO? This explores whether algorithmic complexity is necessary for effective LLM reasoning training.
connects: both simplify PPO for reasoning; AReaL modifies PPO for staleness tolerance
-
Can agent deployment itself generate training signals automatically?
Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.
extends: OpenClaw-RL 4-loop architecture dissolves the serving/training boundary that AReaL's 2-loop async still preserved
-
Can scalar rewards capture all the information in agent feedback?
Exploring whether numerical rewards alone can preserve both the evaluative judgment and directional guidance embedded in natural feedback—or if something crucial gets lost in the conversion.
extends: the signal decomposition that makes the 4-loop architecture's PRM judging layer richer than scalar reward alone
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
- Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
- Bridging Offline and Online Reinforcement Learning for LLMs
- OpenClaw-RL: Train Any Agent Simply by Talking
- Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- The Art of Scaling Reinforcement Learning Compute for LLMs
- Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
Original note title
fully asynchronous rl training decouples generation from training without performance loss