SYNTHESIS NOTE

Topics›Data›this note

Can agents learn beyond what their training data shows?

Explores whether supervised fine-tuning on expert demonstrations creates a hard ceiling on agent competence, or whether agents can generalize to scenarios their curators never captured.

Synthesis note · 2026-05-03 · sourced from Data

The dominant paradigm for training language agents is supervised fine-tuning on expert-curated demonstrations. This bypasses the need for reward signals by letting agents map states to actions using static datasets. But the convenience hides a structural limitation: the agent never interacts with the environment during training, never observes the outcomes of its own actions, and therefore cannot learn from failure, refine its decision-making, or generalize to unseen situations.

The deeper problem is that the agent's competence is bounded by what the demonstration curators imagined. Every state-action pair in the dataset reflects a scenario someone thought to capture. Scenarios outside that imagination — edge cases, recovery from errors, paths the expert would never take — do not exist in the training signal at all. This means the agent learns the expert's idealized trajectory, not the structure of the environment. When the deployed environment presents anything unfamiliar, the agent has no internal model that can extrapolate, because its training never exposed it to consequences.

This is a passivity trap. Scaling high-quality human demonstrations is expensive and difficult to sustain, but even unlimited expert data would not solve the underlying problem — the agent is bound by the coverage of the demonstrations rather than by its own capacity to grow from experience. The demonstration paradigm assumes the world stops where the dataset stops.

The implication for agentic AI design is significant: data quantity and even data quality are insufficient. What agents need is the capacity to convert their own actions into learning signals — which is exactly what Can agents learn from their own actions without external rewards? proposes — requiring the agent to be in the environment, not merely trained on a snapshot of it.

Inquiring lines that read this note 112

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do multi-agent systems achieve genuine cooperation and reasoning?

What drives capability and cost efficiency in agent systems?

How can LLM user simulators model realistic goal-driven conversation?

How can AI agents autonomously learn and transfer skills across tasks?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

How does memorization interact with learning and generalization?

Can curated demonstrations compensate for smaller or simpler training environments?

How should conversational agents balance goal-driven initiative with user control?

Can agents learn user intent from unlabeled video without text labels?

How does objective evolution guide discovery better than fixed planning?

Why do agents confidently report success despite actually failing tasks?

How do knowledge graphs enable efficient multi-hop reasoning over alternatives?

Can knowledge graphs generate scalable training data for deep search agents?

Does externalizing cognitive work and state improve agent reliability?

What pretraining choices and baseline capability constrain reinforcement learning gains?

Can AI-generated outputs constitute genuine knowledge or valid claims?

How does test-time aggregation affect reasoning correctness and reliability?

When do aggregated imperfect demonstrations fail to outperform the best expert?

Do base models contain latent reasoning that training can unlock?

Do emergent abilities result from genuine new capabilities or implicit in-context learning?

Why does verification consistently lag behind AI generation?

How do we evaluate AI systems when user perception misleads actual performance?

Which AI imaginaries dominate training data and shape system behavior most strongly?

How should agents balance memory condensation to optimize context efficiency?

What constrains reinforcement learning's ability to expand model reasoning?

How does the pretrained prior set a capability ceiling for reward model exploration?

Why do reward structures fail to shape long-term agent learning?

How do self-generated feedback mechanisms enable effective model learning?

Does alignment training create blind spots in detecting genuine safety threats?

What makes behavioral cloning produce more persuadable but less aligned agents?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

How does pretrained knowledge constrain what adaptation strategies can achieve?

How do training priors constrain what context information can override?

Why do pretrained model priors reduce the usefulness of retrieved experience?

Can debate mechanisms prevent silent agreement on wrong answers in multi-agent reasoning?

Does training on self-play disagreement data improve multi-agent reasoning outcomes?

Does conversational format create illusions of genuine AI communication?

Why do agents show interaction without influence on semantic content but dramatic action changes?

Can AI systems balance emotional competence with factual reliability?

How does the pretrained prior constrain the ceiling for empathy RL improvements?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

Can single-axis benchmarks accurately predict agent deployment success?

Can single benchmarks predict whether an agent will work in the real world?

Is model self-awareness based on genuine introspection or pattern matching?

How should personalization be implemented to improve AI assistant effectiveness?

Can personalized AI learning systems actually widen rather than narrow educational gaps?

How can models identify insufficient information and respond appropriately without guessing?

How can agents detect missing information before attempting to solve problems?

Can alternative training methods improve on supervised fine-tuning for language models?

How does SDPO relate to agents learning from verbal reflection without parameter updates?

What makes weaker teacher models effective for stronger student training?

Can teachers trained under uncertainty constraints distill better generalizing students?

How does AI assistance affect human cognitive development and reasoning autonomy?

Why does continuous agent inference differ from human user inference?

Do harness improvements transfer across model scales or memorize shortcuts?

What components of agent scaffolding most impact domain-specific output quality?

How does AI adoption affect human skill development and labor equality?

Does codifying expertise into AI agents drive faster labor substitution?

Does AI fluency substitute for verifiable accuracy in human judgment?

How much does domain expertise actually improve human forecasting under uncertainty?

How should models express uncertainty rather than forced confident answers?

Can agents escape weak belief tracking and conservative action selection traps?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 153 in 2-hop network ·dense cluster Open in graph ↗

Can agents learn beyond what their training data… Can agents learn from their own actions without ex… Can non-reasoning models catch up with more comput… Can models trained on many imperfect experts outpe… Can careful selection of 78 demos outperform massi… Why do LLM agents ignore condensed experience summ… Why do AI agents fail at workplace social interact…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can agents learn from their own actions without external rewards? Explores whether future states produced by an agent's own decisions can serve as supervision signals, bridging the gap between passive imitation learning and reward-dependent reinforcement learning.
extends: companion piece — diagnosis vs treatment of the passivity trap
Can non-reasoning models catch up with more compute? Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
exemplifies: SFT/imitation ceiling argument generalizes — bounded by training demonstration quality
Can models trained on many imperfect experts outperform each one? Do generative models trained on diverse, imperfect human experts develop an implicit consensus that surpasses any individual contributor? This explores whether aggregating diverse perspectives at training time, rather than inference time, can denoise human biases.
tension: counter-claim — diverse expert demonstrations can exceed any individual expert; the bound here is curatorial breadth, not aggregation
Can careful selection of 78 demos outperform massive training datasets? Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
tension: LIMI argues curation produces agency from minimal data; this note argues curation alone is the ceiling — both can be right depending on whether environment interaction is downstream
Why do LLM agents ignore condensed experience summaries? LLM agents faithfully learn from raw experience but systematically disregard condensed summaries of the same experience. This study investigates whether the problem lies in how summaries are made, how models process them, or whether models simply don't need them.
complements: even after escaping demonstration imagination, agents privilege raw over condensed experience — the imagination problem recurs at the experience-summarization level
Why do AI agents fail at workplace social interaction? Explores why current AI agents struggle most with communicating and coordinating with colleagues in realistic workplace settings, despite strong reasoning capabilities in other domains.
exemplifies: the deployment gap that demonstration training cannot close — real task variability exceeds demonstration coverage

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Agent Learning via Early Experience0.83 match · arxiv ↗
Behavioral Exploration: Learning to Explore via In-Context Adaptation0.81 match · arxiv ↗
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver0.81 match · arxiv ↗
Artifacts as Memory Beyond the Agent Boundary0.81 match · arxiv ↗
SkillOS: Learning Skill Curation for Self-Evolving Agents0.80 match · arxiv ↗
LIMI: Less is More for Agency0.80 match · arxiv ↗
RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents0.80 match · arxiv ↗
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning0.80 match · arxiv ↗

Original note title

expert demonstrations lock agents into the imagination of the training data — restricting what an agent can learn to scenarios its curators happened to consider