Can agents learn from their own actions without external rewards?

Explores whether future states produced by an agent's own decisions can serve as supervision signals, bridging the gap between passive imitation learning and reward-dependent reinforcement learning.

Synthesis note · 2026-05-03 · sourced from Data

Most language agents are trained either through supervised fine-tuning on expert demonstrations (which scales poorly and locks the agent into the imagination of its dataset) or through reinforcement learning (which fails when environments lack verifiable rewards or require long-horizon credit assignment). The early experience paradigm sits between these: the agent proposes its own actions in the environment, and the future states resulting from those actions become supervision signals — without requiring any reward signal at all.

The key move is reframing what counts as "supervision." In SFT, supervision means a human-labeled expert action. In RL, supervision means a scalar reward. In early experience, supervision means the consequence — the next state — that follows the agent's own action. This consequence is always available regardless of whether an environment exposes ground truth, because the environment always responds to actions even when it does not score them. A web form may not tell you whether you filled it out correctly, but it always tells you what happens next.

Two strategies operationalize this principle: implicit world modeling (using collected future states to ground the policy in environment dynamics by predicting next states) and self-reflection (comparing the agent's behavior to expert demonstrations to extract lessons from suboptimal decisions). Both strategies share the principle that consequences-of-actions constitute experience, even without rewards.

Across eight diverse environments, both strategies consistently outperform pure imitation baselines, achieve comparable performance with half the expert data or less, and serve as superior warm-starts for subsequent RL. The paradigm is therefore not a substitute for RL but a practical bridge — early experience trains the agent to understand its environment before any reward signal arrives, which means RL fine-tuning starts from a much stronger initialization.

Inquiring lines that read this note 42

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How does AI assistance affect human cognitive development and reasoning autonomy?

Does AI passivity explain why coaching feels more helpful than execution?

Can self-supervised signals enable process supervision without human annotation?

What pretraining choices and baseline capability constrain reinforcement learning gains?

How do interface design choices shape consciousness attribution?

Can AI systems execute strategies without conscious intention behind them?

How can AI agents autonomously learn and transfer skills across tasks?

Why do reward structures fail to shape long-term agent learning?

Does self-reflection enable models to reliably correct their errors?

How do implicit world models and self-reflection operationalize consequence-based learning?

How can process reward models supervise complex reasoning traces?

How do we evaluate AI systems when user perception misleads actual performance?

How do multi-agent systems achieve genuine cooperation and reasoning?

How can AI systems learn from failures without cascading errors?

How does sliding the start state backward create informative learning signals?

How do self-generated feedback mechanisms enable effective model learning?

What properties determine whether reward signals teach genuine reasoning?

Is model self-awareness based on genuine introspection or pattern matching?

Can models detect when their own trajectory is on-policy versus off-policy?

How do aggregate reward models systematically exclude minority user preferences?

What makes reward models fundamentally different from policy discriminators?

How should models express uncertainty rather than forced confident answers?

Can agents escape weak belief tracking and conservative action selection traps?

Do language models develop causal world models or rely on statistical patterns?

Does next-state prediction alone build mechanistic world models or just sophisticated interpolation?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 131 in 2-hop network ·medium cluster Open in graph ↗

Can agents learn from their own actions without … Can agents learn beyond what their training data s… Can agent deployment itself generate training sign… Can scalar rewards capture all the information in … Can transformers learn to solve new problems withi… Can agents learn from failure without updating the… Can careful selection of 78 demos outperform massi…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can agents learn beyond what their training data shows? Explores whether supervised fine-tuning on expert demonstrations creates a hard ceiling on agent competence, or whether agents can generalize to scenarios their curators never captured.
extends: companion piece — passivity trap is the diagnosis, early experience is the treatment
Can agent deployment itself generate training signals automatically? Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.
exemplifies: same principle in production — OpenClaw-RL treats next-state as universal supervision; this note generalizes the paradigm
Can scalar rewards capture all the information in agent feedback? Exploring whether numerical rewards alone can preserve both the evaluative judgment and directional guidance embedded in natural feedback—or if something crucial gets lost in the conversion.
extends: refines what supervision the next state actually contains — beyond binary reward
Can transformers learn to solve new problems within episodes? Explores whether transformer models can develop meta-learning abilities through RL training, enabling them to adapt to unseen environments by learning from within-episode experience alone, without updating weights.
complements: ICRL treats in-context experience as supervision at deployment; early experience does it during training
Can agents learn from failure without updating their weights? Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.
exemplifies: same self-reflection strategy in a parameter-free form — early experience is the parametric version
Can careful selection of 78 demos outperform massive training datasets? Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
tension: LIMI argues curated demonstrations create agency; early-experience argues lived consequences do — two stories about where agency comes from

Can agents learn from their own actions without external rewards?

Inquiring lines that read this note 42

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4