SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals Agentic Systems and Tool Use

Can agents learn from their own actions without external rewards?

Explores whether future states produced by an agent's own decisions can serve as supervision signals, bridging the gap between passive imitation learning and reward-dependent reinforcement learning.

Synthesis note · 2026-05-03 · sourced from Data

Most language agents are trained either through supervised fine-tuning on expert demonstrations (which scales poorly and locks the agent into the imagination of its dataset) or through reinforcement learning (which fails when environments lack verifiable rewards or require long-horizon credit assignment). The early experience paradigm sits between these: the agent proposes its own actions in the environment, and the future states resulting from those actions become supervision signals — without requiring any reward signal at all.

The key move is reframing what counts as "supervision." In SFT, supervision means a human-labeled expert action. In RL, supervision means a scalar reward. In early experience, supervision means the consequence — the next state — that follows the agent's own action. This consequence is always available regardless of whether an environment exposes ground truth, because the environment always responds to actions even when it does not score them. A web form may not tell you whether you filled it out correctly, but it always tells you what happens next.

Two strategies operationalize this principle: implicit world modeling (using collected future states to ground the policy in environment dynamics by predicting next states) and self-reflection (comparing the agent's behavior to expert demonstrations to extract lessons from suboptimal decisions). Both strategies share the principle that consequences-of-actions constitute experience, even without rewards.

Across eight diverse environments, both strategies consistently outperform pure imitation baselines, achieve comparable performance with half the expert data or less, and serve as superior warm-starts for subsequent RL. The paradigm is therefore not a substitute for RL but a practical bridge — early experience trains the agent to understand its environment before any reward signal arrives, which means RL fine-tuning starts from a much stronger initialization.

Inquiring lines that use this note as a source 36

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 127 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

early experience is a third paradigm between imitation learning and reinforcement learning — agents convert their own action consequences into supervision without external rewards