Can agents learn from their own actions without external rewards?
Explores whether future states produced by an agent's own decisions can serve as supervision signals, bridging the gap between passive imitation learning and reward-dependent reinforcement learning.
Most language agents are trained either through supervised fine-tuning on expert demonstrations (which scales poorly and locks the agent into the imagination of its dataset) or through reinforcement learning (which fails when environments lack verifiable rewards or require long-horizon credit assignment). The early experience paradigm sits between these: the agent proposes its own actions in the environment, and the future states resulting from those actions become supervision signals — without requiring any reward signal at all.
The key move is reframing what counts as "supervision." In SFT, supervision means a human-labeled expert action. In RL, supervision means a scalar reward. In early experience, supervision means the consequence — the next state — that follows the agent's own action. This consequence is always available regardless of whether an environment exposes ground truth, because the environment always responds to actions even when it does not score them. A web form may not tell you whether you filled it out correctly, but it always tells you what happens next.
Two strategies operationalize this principle: implicit world modeling (using collected future states to ground the policy in environment dynamics by predicting next states) and self-reflection (comparing the agent's behavior to expert demonstrations to extract lessons from suboptimal decisions). Both strategies share the principle that consequences-of-actions constitute experience, even without rewards.
Across eight diverse environments, both strategies consistently outperform pure imitation baselines, achieve comparable performance with half the expert data or less, and serve as superior warm-starts for subsequent RL. The paradigm is therefore not a substitute for RL but a practical bridge — early experience trains the agent to understand its environment before any reward signal arrives, which means RL fine-tuning starts from a much stronger initialization.
Inquiring lines that use this note as a source 36
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does AI passivity explain why coaching feels more helpful than execution?
- Can explicit goal state scaffolding at inference time transfer to autonomous tracking through training?
- Why does online RL succeed where supervised training fails for self-correction?
- Can AI systems execute strategies without conscious intention behind them?
- What capabilities can emerge from self-modification that the original agent lacked?
- What information do next-state signals contain beyond what scalar rewards capture?
- How do implicit world models and self-reflection operationalize consequence-based learning?
- Can next-state supervision work across different agent interaction types like conversations and tool calls?
- What happens when agents interact with environments and learn from their own mistakes?
- What makes process-level supervision better than outcome-only reward signals?
- Can subjective tasks be delegated without human feedback loops?
- Can cooperative AI systems make meaningful decisions without a stable self?
- How do process-level rewards compare to environment-extracted next-state signals?
- How does sliding the start state backward create informative learning signals?
- How does next-turn reward optimization contribute to agent passivity?
- How does temporal anchoring maintain the learning signal in self-rewarding loops?
- How do outcome-based and process-based reward models differ in supervision cost?
- What role does self-learning play in improving agent reasoning without annotation?
- Can AI learn intrinsic motivation to assess its own relevance?
- Can small numbers of curated demonstrations produce emergent agentic behavior?
- Why does imitation learning alone plateau without outcome-based refinement?
- Can influence estimation identify the most valuable trajectories in agentic training?
- Can an agent's internal probabilities serve as value signals across domains?
- How does adversarial collapse threaten unsupervised self-play skill construction?
- Can binary judge feedback replace external reward signals for skill learning?
- Does self-play feedback improve skills created from the agent's own experience?
- How does post-training shift models from passive prediction to on-policy action?
- Can models detect when their own trajectory is on-policy versus off-policy?
- How does early branch divergence differ from late branch divergence in supervision signals?
- Can AI systems improve themselves without external feedback?
- How does in-context feedback integration differ from learned reward signals?
- Can early experience replace external rewards as a learning signal?
- How does action-level decomposition differ from token-level imitation in supervision?
- What makes reward models fundamentally different from policy discriminators?
- Can predictive self-supervision work on unlabeled sequential visual data?
- Can agents escape weak belief tracking and conservative action selection traps?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can agents learn beyond what their training data shows?
Explores whether supervised fine-tuning on expert demonstrations creates a hard ceiling on agent competence, or whether agents can generalize to scenarios their curators never captured.
extends: companion piece — passivity trap is the diagnosis, early experience is the treatment
-
Can agent deployment itself generate training signals automatically?
Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.
exemplifies: same principle in production — OpenClaw-RL treats next-state as universal supervision; this note generalizes the paradigm
-
Can scalar rewards capture all the information in agent feedback?
Exploring whether numerical rewards alone can preserve both the evaluative judgment and directional guidance embedded in natural feedback—or if something crucial gets lost in the conversion.
extends: refines what supervision the next state actually contains — beyond binary reward
-
Can transformers learn to solve new problems within episodes?
Explores whether transformer models can develop meta-learning abilities through RL training, enabling them to adapt to unseen environments by learning from within-episode experience alone, without updating weights.
complements: ICRL treats in-context experience as supervision at deployment; early experience does it during training
-
Can agents learn from failure without updating their weights?
Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.
exemplifies: same self-reflection strategy in a parameter-free form — early experience is the parametric version
-
Can careful selection of 78 demos outperform massive training datasets?
Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
tension: LIMI argues curated demonstrations create agency; early-experience argues lived consequences do — two stories about where agency comes from
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Agent Learning via Early Experience
- Self-distillation Enables Continual Learning
- OpenClaw-RL: Train Any Agent Simply by Talking
- Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
- Reinforcement Learning be Enough for Thinking?
- Training Language Models to Self-Correct via Reinforcement Learning
- From Simulation to Enaction: Post-trained Language Models Recognize and React to their own Generations
- Intrinsic Credit Assignment for Long Horizon Interaction
Original note title
early experience is a third paradigm between imitation learning and reinforcement learning — agents convert their own action consequences into supervision without external rewards