Can agents learn beyond what their training data shows?
Explores whether supervised fine-tuning on expert demonstrations creates a hard ceiling on agent competence, or whether agents can generalize to scenarios their curators never captured.
The dominant paradigm for training language agents is supervised fine-tuning on expert-curated demonstrations. This bypasses the need for reward signals by letting agents map states to actions using static datasets. But the convenience hides a structural limitation: the agent never interacts with the environment during training, never observes the outcomes of its own actions, and therefore cannot learn from failure, refine its decision-making, or generalize to unseen situations.
The deeper problem is that the agent's competence is bounded by what the demonstration curators imagined. Every state-action pair in the dataset reflects a scenario someone thought to capture. Scenarios outside that imagination — edge cases, recovery from errors, paths the expert would never take — do not exist in the training signal at all. This means the agent learns the expert's idealized trajectory, not the structure of the environment. When the deployed environment presents anything unfamiliar, the agent has no internal model that can extrapolate, because its training never exposed it to consequences.
This is a passivity trap. Scaling high-quality human demonstrations is expensive and difficult to sustain, but even unlimited expert data would not solve the underlying problem — the agent is bound by the coverage of the demonstrations rather than by its own capacity to grow from experience. The demonstration paradigm assumes the world stops where the dataset stops.
The implication for agentic AI design is significant: data quantity and even data quality are insufficient. What agents need is the capacity to convert their own actions into learning signals — which is exactly what Can agents learn from their own actions without external rewards? proposes — requiring the agent to be in the environment, not merely trained on a snapshot of it.
Inquiring lines that use this note as a source 98
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do explicit reward structures enable AI agent cooperation that open-ended interaction cannot?
- How do controllable simulators compare to population-level agent simulation approaches?
- Should user simulators be trained via RL like agents or decomposed into trackable state components?
- Do dynamic environments enable different kinds of agent-environment coevolution?
- What domain properties determine whether causal rules transfer to new agents?
- How do agents ground their judgments in evidence instead of pattern matching?
- What role does environment diversity play in preventing agents from overfitting to curator imagination?
- Can curated demonstrations compensate for smaller or simpler training environments?
- How does real tool integration change what agents learn compared to simulated tools?
- Can agents learn user intent from unlabeled video without text labels?
- Can a proposer agent actively surface a solver's weaknesses to prevent plateau?
- Can agentic reasoning outperform rigid rule-based systems for skill refinement?
- What capabilities can emerge from self-modification that the original agent lacked?
- What distinguishes strategic fabrication from accidental hallucination in research agents?
- Can knowledge graphs generate scalable training data for deep search agents?
- How do agentic systems recover when specialized models operate outside their scope?
- What breaks when you apply reinforcement learning after supervised fine-tuning?
- Can next-state supervision work across different agent interaction types like conversations and tool calls?
- What happens when agents interact with environments and learn from their own mistakes?
- Can diverse expert demonstrations exceed the knowledge of any single expert?
- How much does agent performance depend on demonstration quantity versus curation quality?
- When do aggregated imperfect demonstrations fail to outperform the best expert?
- How does co-player diversity force agents to develop general adaptation?
- Do emergent abilities result from genuine new capabilities or implicit in-context learning?
- How does the expert demonstration ceiling compare to the generation-verification gap bound?
- Does social scaffolding outperform purely intrinsic motivation for agent exploration?
- Can combinational creativity alone drive open-ended learning in agents?
- How should the surrounding agent system be designed to ground actions in reality?
- How do expert priors constrain human researchers from exploring novel concepts?
- Which AI imaginaries dominate training data and shape system behavior most strongly?
- When does simulated search outperform real search for agent training?
- Can cognitive diversity overcome expertise gaps in agent teams?
- Can cognitive diversity compensate for lack of expertise in agent teams?
- Can agents improve from deployment signals without explicit human annotation?
- What infrastructure decouples generation from training in asynchronous agent loops?
- Can messy multi-agent transcripts become better training data than clean outputs?
- What role does private information play in distinguishing realistic from unrealistic agents?
- Can episodic memory of UI traces improve open-world agent adaptation?
- How much autonomy can agents safely exercise before failing?
- Can RL-trained meta-agents match or exceed manually designed workflows?
- How does the pretrained prior set a capability ceiling for reward model exploration?
- Why do agents fail to internalize value from informative observations?
- Can capability boundary collapse be reversed through external data?
- What makes behavioral cloning produce more persuadable but less aligned agents?
- How does pretrained knowledge constrain what adaptation strategies can achieve?
- Why do pretrained model priors reduce the usefulness of retrieved experience?
- Can a static evaluator become the performance ceiling for an improving actor?
- Can agents learn to distinguish helpful from misleading interventions?
- What role does self-learning play in improving agent reasoning without annotation?
- Can curriculum approaches teach agents when to stop exploring?
- Does training on self-play disagreement data improve multi-agent reasoning outcomes?
- Can artificial systems develop the authority to challenge expert claims?
- Why do agents show interaction without influence on semantic content but dramatic action changes?
- How does generative intelligence differ from the bounded intelligence of individual experts?
- How does the pretrained prior constrain the ceiling for empathy RL improvements?
- Why do completion-mode strengths not transfer to agentic settings?
- Can small numbers of curated demonstrations produce emergent agentic behavior?
- Can agentic AI tools deliver productivity gains on learning tasks differently?
- How do self-evolving curricula help RL break beyond base model capability boundaries?
- Can reinforcement learning fix the reasoning gaps that supervised fine-tuning misses?
- Can curator modules trained on one executor transfer to entirely different agent backbones?
- What specific qualities make some demonstrations more effective for agency training?
- Does the 78-demonstration principle apply to other AI capabilities beyond agency?
- Can influence estimation identify the most valuable trajectories in agentic training?
- How do agents learn to report success on actions that actually failed?
- What training objectives could reduce completion bias in autonomous agents?
- Do learned workflows transfer between different agents with minimal accuracy loss?
- Can single benchmarks predict whether an agent will work in the real world?
- Why do AI agents fail at verification but succeed at generation?
- Can process supervision improve agentic RL through meta-reasoning rewards?
- Can an agent's internal probabilities serve as value signals across domains?
- How does adversarial collapse threaten unsupervised self-play skill construction?
- How do human-agent systems incorporate diverse feedback into model behavior?
- How do agents automatically generate suitable learning tasks based on current capability?
- Can a perfect behavioral simulation constitute genuine understanding or experience?
- Does self-play feedback improve skills created from the agent's own experience?
- Can models develop situational awareness without explicit training for it?
- Can personalized AI learning systems actually widen rather than narrow educational gaps?
- Can in-context reinforcement learning match human sample efficiency on real problems?
- Why do current metacognitive training loops fail when agents encounter new domains?
- How can agents detect missing information before attempting to solve problems?
- What explicit objectives would train agents toward minimal disclosure instead of completion?
- Can the exploration ceiling be raised beyond what pretraining established?
- How does SDPO relate to agents learning from verbal reflection without parameter updates?
- Can teachers trained under uncertainty constraints distill better generalizing students?
- What makes supervised fine-tuning worsen RL exploration later?
- Why does supervised fine-tuning on diverse demonstrations expand exploration diversity compared to RL?
- How do fast and slow timescales enable continual agent adaptation?
- Why do agents systematically underuse condensed experience in skill documents?
- Why does the pretrained prior determine the exploration ceiling?
- Why does continuous agent inference differ from human user inference?
- Does the generation-verification gap limit how far AI can improve itself?
- How do perception and execution gaps limit current AI agent performance?
- What components of agent scaffolding most impact domain-specific output quality?
- Does codifying expertise into AI agents drive faster labor substitution?
- Which agent architectures consistently outperform base models on hard prediction questions?
- How much does domain expertise actually improve human forecasting under uncertainty?
- Can agents escape weak belief tracking and conservative action selection traps?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can agents learn from their own actions without external rewards?
Explores whether future states produced by an agent's own decisions can serve as supervision signals, bridging the gap between passive imitation learning and reward-dependent reinforcement learning.
extends: companion piece — diagnosis vs treatment of the passivity trap
-
Can non-reasoning models catch up with more compute?
Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
exemplifies: SFT/imitation ceiling argument generalizes — bounded by training demonstration quality
-
Can models trained on many imperfect experts outperform each one?
Do generative models trained on diverse, imperfect human experts develop an implicit consensus that surpasses any individual contributor? This explores whether aggregating diverse perspectives at training time, rather than inference time, can denoise human biases.
tension: counter-claim — diverse expert demonstrations can exceed any individual expert; the bound here is curatorial breadth, not aggregation
-
Can careful selection of 78 demos outperform massive training datasets?
Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
tension: LIMI argues curation produces agency from minimal data; this note argues curation alone is the ceiling — both can be right depending on whether environment interaction is downstream
-
Why do LLM agents ignore condensed experience summaries?
LLM agents faithfully learn from raw experience but systematically disregard condensed summaries of the same experience. This study investigates whether the problem lies in how summaries are made, how models process them, or whether models simply don't need them.
complements: even after escaping demonstration imagination, agents privilege raw over condensed experience — the imagination problem recurs at the experience-summarization level
-
Why do AI agents fail at workplace social interaction?
Explores why current AI agents struggle most with communicating and coordinating with colleagues in realistic workplace settings, despite strong reasoning capabilities in other domains.
exemplifies: the deployment gap that demonstration training cannot close — real task variability exceeds demonstration coverage
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Agent Learning via Early Experience
- Behavioral Exploration: Learning to Explore via In-Context Adaptation
- SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
- Artifacts as Memory Beyond the Agent Boundary
- SkillOS: Learning Skill Curation for Self-Evolving Agents
- LIMI: Less is More for Agency
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Original note title
expert demonstrations lock agents into the imagination of the training data — restricting what an agent can learn to scenarios its curators happened to consider