INQUIRING LINE

What infrastructure decouples generation from training in asynchronous agent loops?

This explores the systems-level question of how a learning agent can keep acting and generating data while its model is being trained at the same time — rather than freezing one to do the other.


This explores the systems-level question of how a learning agent can keep acting and generating data while its model is being trained at the same time — rather than freezing one to do the other. The corpus has a direct answer and a set of adjacent ideas that reframe what 'decoupling' even buys you. The clearest piece of infrastructure is AReaL's fully asynchronous RL design Can RL training run while generation continues without waiting?: generation workers keep producing rollouts continuously while a separate trainer updates weights, and a modified PPO absorbs the fact that samples now arrive 'stale' — generated by a slightly older model version than the one currently training. The payoff is high GPU utilization and, importantly, practical multi-turn RL, where a single long episode would otherwise stall the whole pipeline waiting on the slowest trajectory.

What makes this matter shows up in the papers about *why* on-policy interaction is worth the engineering trouble. Agents trained only on static expert demonstrations are capped by the imagination of whoever built the dataset — they never see their own failures or anything outside the demonstrated scenarios Can agents learn beyond what their training data shows?. Asynchronous loops exist precisely so an agent can learn from its own live experience without the throughput penalty. Pushing that logic further, one line of work argues the training signal doesn't even need to be curated: every action an agent takes produces a next-state signal — a user reply, a tool output, an error, a changed screen — that can feed the policy directly, unifying all agent training under one continuous loop Can agent deployment itself generate training signals automatically?.

The most surprising adjacent move is to decouple generation from training by removing weight updates from the critical path entirely. AgentFly reframes learning as memory operations over a Memory-augmented MDP — case, subtask, and tool memories carry credit assignment and policy improvement while the model's parameters stay frozen Can agents learn continuously from experience without updating weights?. Here 'generation' and 'learning' aren't two synchronized processes to interleave; learning lives in an external store the agent reads and writes during normal operation. SkillOS shows a related split on the skill-library side: a trainable curator evolves the repository while the executor stays frozen, so the thing that improves and the thing that runs are different components on different update clocks Can a separate trained curator improve skill libraries better than frozen agents?.

One more enabler is worth knowing about, because asynchronous RL is bottlenecked by reward latency, not just rollout latency. If every reward requires actually executing code, the trainer waits. Execution-free verification — structured reasoning that reaches ~93% accuracy judging whether two code patches are equivalent — crosses the reliability bar to serve as an RL reward signal without running anything Can structured reasoning replace code execution for RL rewards?. That's infrastructure too: it decouples the reward from the runtime, the same way async training decouples the update from the rollout.

The thread running through all of these: 'decoupling generation from training' isn't one trick but a design axis. You can stagger the two in time (async RL with stale-sample-tolerant PPO), move learning out of the weights into memory or a curated library, or cut the reward's dependence on execution — and the right choice depends on which synchronization point is actually stalling your loop.


Sources 6 notes

Can RL training run while generation continues without waiting?

AReaL enables continuous generation across workers while training runs on mixed model versions using modified PPO. The system achieves high GPU utilization and handles stale samples effectively, making multi-turn RL practical.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Can agent deployment itself generate training signals automatically?

Every agent action produces a next-state signal (user reply, tool output, error, GUI change) that can train the policy directly. This universal signal source eliminates the need for separate training datasets across conversations, terminal tasks, SWE, and tool use.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Can structured reasoning replace code execution for RL rewards?

Semi-formal reasoning templates enable execution-free patch equivalence verification at 93% accuracy on real agent code, crossing the reliability threshold needed for RL reward signals. This makes execution-free verification viable for certain task classes like fault localization and code reasoning.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a systems architect auditing asynchronous agent loop infrastructure. The question remains: what decouples generation from training, and which decoupling strategy best fits a given bottleneck?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat each as a snapshot of capability at its publication date.
- AReaL's fully-asynchronous RL (2025-05) tolerates stale samples in PPO, achieving high GPU utilization and enabling practical multi-turn RL without freezing rollouts.
- Memory-augmented MDPs (AgentFly, ~2026) decouple learning entirely from weight updates: credit assignment and policy improvement live in external memory while parameters stay frozen.
- Skill-library decoupling (SkillOS, 2026-05) separates a trainable curator from a frozen executor, so improvement and execution run on different clocks.
- Execution-free code verification (~2026) reaches ~93% accuracy judging patch equivalence, decoupling reward signals from runtime cost and unlocking async reward generation.
- On-policy live interaction (2025–2026 consensus) prevents agents being capped by static demonstration imagination; asynchronous loops exist to let agents learn from their own failures without throughput penalty.

Anchor papers (verify; mind their dates):
- arXiv:2505.24298 (AReaL, 2025-05)
- arXiv:2605.06614 (SkillOS, 2026-05)
- arXiv:2603.01896 (Agentic Code Reasoning, 2026-03)
- arXiv:2402.16823 (Language Agents as Optimizable Graphs, 2024-02)

Your task:
(1) RE-TEST EACH DECOUPLING STRATEGY. For stale-sample tolerance, memory-augmented learning, curator–executor splits, and execution-free rewards, determine whether newer models, inference optimization (KV caching, quantization, batching), stronger verifiers, or orchestration (ray, vLLM multi-instance) have since relaxed, hardened, or obsoleted each constraint. Plainly state: does the bottleneck each solves still exist, or has it migrated?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. If any paper argues a single-model, on-demand approach now outperforms asynchronous decoupling at scale, or shows that end-to-end training without async infrastructure matches or beats it, cite that tension explicitly.
(3) Propose 2 research questions that ASSUME the regime may have moved: one on whether per-step inference cost (not training bandwidth) is now the actual bottleneck, one on whether memory-based learning and weight-based learning have converged into a unified architecture.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines