SYNTHESIS NOTE
Agentic Systems and Tool Use

Can agents learn beyond what their training data shows?

Explores whether supervised fine-tuning on expert demonstrations creates a hard ceiling on agent competence, or whether agents can generalize to scenarios their curators never captured.

Synthesis note · 2026-05-03 · sourced from Data

The dominant paradigm for training language agents is supervised fine-tuning on expert-curated demonstrations. This bypasses the need for reward signals by letting agents map states to actions using static datasets. But the convenience hides a structural limitation: the agent never interacts with the environment during training, never observes the outcomes of its own actions, and therefore cannot learn from failure, refine its decision-making, or generalize to unseen situations.

The deeper problem is that the agent's competence is bounded by what the demonstration curators imagined. Every state-action pair in the dataset reflects a scenario someone thought to capture. Scenarios outside that imagination — edge cases, recovery from errors, paths the expert would never take — do not exist in the training signal at all. This means the agent learns the expert's idealized trajectory, not the structure of the environment. When the deployed environment presents anything unfamiliar, the agent has no internal model that can extrapolate, because its training never exposed it to consequences.

This is a passivity trap. Scaling high-quality human demonstrations is expensive and difficult to sustain, but even unlimited expert data would not solve the underlying problem — the agent is bound by the coverage of the demonstrations rather than by its own capacity to grow from experience. The demonstration paradigm assumes the world stops where the dataset stops.

The implication for agentic AI design is significant: data quantity and even data quality are insufficient. What agents need is the capacity to convert their own actions into learning signals — which is exactly what Can agents learn from their own actions without external rewards? proposes — requiring the agent to be in the environment, not merely trained on a snapshot of it.

Inquiring lines that use this note as a source 98

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 159 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

expert demonstrations lock agents into the imagination of the training data — restricting what an agent can learn to scenarios its curators happened to consider