Can agent deployment itself generate training signals automatically?
Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.
The OpenClaw-RL framework rests on a simple observation that reframes agentic RL entirely: every agent action generates a next-state signal — the user reply, tool output, terminal state change, GUI transition, or test verdict that follows the action — and this signal is universal across interaction types. Personal conversations, terminal executions, GUI clicks, SWE tasks, and tool-call traces are not separate training problems requiring separate datasets; they are all interactions that can feed the same policy through the same loop.
The implication is structural. Current agentic RL systems inherit an assumption from batch reinforcement learning: collect a dataset, annotate rewards, train the policy, deploy. This assumption is incompatible with how agents actually operate in the world, because agents are never NOT generating next-state signals during deployment. A user who re-queries after a bad response signals dissatisfaction. A passing test signals success. An error trace signals a specific failure mode. These signals exist whether or not anyone is capturing them for training. The waste is not technical — it is the dominant inefficiency of production agents.
Reframing agentic RL around live next-state signals has two consequences. First, it means personal agents can improve simply by being used: no annotation pipeline, no preference collection, no human labeling session — just normal conversational deployment with signal recovery in the loop. Second, it means agentic settings that previously required bespoke training regimes (SWE, GUI navigation, tool use) can share infrastructure, because the training signal is extracted from the environment at the same representational level (next-state transitions) rather than at the task-specific reward level.
This extends and refines existing directions. Memory-based online learning (Can agents learn continuously from experience without updating weights?) shows agents can adapt without fine-tuning; OpenClaw-RL shows they can adapt with fine-tuning from the same signal stream. Process-level supervision (Does supervising retrieval steps outperform final answer rewards?) provides dense per-step rewards; next-state signals provide those rewards automatically from the environment rather than requiring labeled process traces. The concept of next-state-as-training-source dissolves the distinction between deployment and training data collection.
The limiting factor is not signal availability — it is abundant. The limiting factor is signal interpretation, which is where the evaluative/directive decomposition (see Can scalar rewards capture all the information in agent feedback?) becomes the real design question.
Inquiring lines that use this note as a source 12
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What deployment feedback loops amplify LLM pretraining popularity in live systems?
- Can next-state supervision work across different agent interaction types like conversations and tool calls?
- Can agents improve from deployment signals without explicit human annotation?
- What infrastructure decouples generation from training in asynchronous agent loops?
- What specific qualities make some demonstrations more effective for agency training?
- How should harness infrastructure validate code that agents generate themselves?
- How do agents automatically generate suitable learning tasks based on current capability?
- How do you prevent stale reward signals when skills evolve during deployment?
- What metrics replace throughput per token for agent deployment?
- How do you extract reward signals when all rollouts fail?
- How does machine feedback enable discovery at test time?
- Can test environments reliably predict how models behave in actual deployment?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can scalar rewards capture all the information in agent feedback?
Exploring whether numerical rewards alone can preserve both the evaluative judgment and directional guidance embedded in natural feedback—or if something crucial gets lost in the conversion.
the signal decomposition that makes next-state learning actually work
-
Can RL training run while generation continues without waiting?
Synchronous RL systems waste compute time waiting for slow generation steps. Can training and generation truly decouple while maintaining performance on reasoning tasks?
the infrastructure precondition; OpenClaw-RL extends this from 2-loop to 4-loop decoupling
-
Can agents learn continuously from experience without updating weights?
This explores whether LLM agents can adapt to new tasks and failures by retrieving past experiences from memory alone, rather than requiring expensive parameter fine-tuning or rigid hardcoded rules.
complementary continual adaptation via memory rather than weights
-
Can reinforcement learning scale beyond single-turn language tasks?
Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.
next-state signals are the natural credit-assignment source for long-horizon tasks
-
Can full episode rewards per step enable better credit assignment?
Can attributing cumulative episode reward to every step in a trajectory, rather than discounting by step distance, actually solve credit assignment in sequential LLM decision-making? This challenges intuitive RL assumptions about how credit should flow backward through time.
complementary credit-assignment approach for multi-turn RL
-
Does supervising retrieval steps outperform final answer rewards?
Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.
dense process rewards, but derived from annotation rather than environment
-
Can natural language feedback overcome numerical reward plateaus?
Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.
the directive component of next-state signals is exactly natural-language feedback
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- OpenClaw-RL: Train Any Agent Simply by Talking
- Agent Learning via Early Experience
- Adaptation of Agentic AI
- rStar2-Agent: Agentic Reasoning Technical Report
- SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
- Adapting LLM Agents with Universal Feedback in Communication
- Code as Agent Harness
- AutoGLM: Autonomous Foundation Agents for GUIs
Original note title
next-state signals from any agent interaction are a universal live learning source that unifies personal conversations terminal GUI SWE and tool-call training