INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How does memorization interact wit…›this inquiring line

Showing an AI great examples shapes how it behaves — but its ceiling is the imagination of whoever picked them.

Can curated demonstrations compensate for smaller or simpler training environments?

This explores whether hand-picked example demonstrations can substitute for rich, interactive training environments — and the corpus is mostly a warning that they can't fully, though smart curation extracts more from what you have.

This reads the question as: if you can't give an agent a big, interactive environment to learn in, can you make up the gap with carefully chosen demonstrations instead? The corpus suggests demonstrations are a real lever but a capped one — they shape *how* a model behaves more than they expand *what* it can do. The sharpest statement of the ceiling is that agents trained purely on static expert demonstrations get locked into the imagination of whoever built the dataset: because they never interact with an environment, they can't learn from their own failures or generalize past the scenarios they were shown Can agents learn beyond what their training data shows?. So demonstrations don't *replace* a richer environment — they substitute the curator's foresight for the environment's feedback, and competence is bounded by the former.

There's an even more deflating finding about what demonstrations actually teach. Instruction tuning on semantically empty or deliberately wrong instructions performs almost identically to correct ones — what transfers is knowledge of the output *space*, not task understanding Does instruction tuning teach task understanding or output format?. Read alongside the locked-in-imagination result, this reframes the question: curated demonstrations are very good at the cheap thing (teaching format, shape, the space of valid answers) and structurally weak at the expensive thing (genuine new capability). If your simpler environment's deficit is 'the model doesn't know the output format yet,' demonstrations compensate beautifully. If the deficit is 'the model needs to discover strategies through trial and error,' they don't.

Where the corpus turns constructive is on *curation as a multiplier* — getting more out of the demonstrations you have rather than needing more of them. Ordering matters enormously: sequencing demonstrations from harder (sparse representations) to easier yields real gains with no extra labels Can representation sparsity order few-shot demonstrations effectively?, and something as crude as *where* a demo block sits in the prompt swings accuracy up to 20% independent of content How much does demo position alone affect in-context learning accuracy?. So a small, well-arranged set can outperform a larger careless one. The strongest 'compensation' story is sequencing demonstration-style imitation *before* exploration: supervised imitation first establishes reasonable behaviors, which then makes a thin reward environment informative enough for reinforcement to sharpen — and the combination beats either alone Does sequencing imitation then exploration training improve reasoning?. Demonstrations here don't replace the environment; they make a weak one usable.

Two cautions complicate any 'just add good demos' instinct. Quality is relative to the learner: teacher-refined, objectively-better demonstrations actively *degrade* a student when they exceed its learning frontier, so the right move is filtering refinements against the student's own profile rather than maximizing demonstration quality Does teacher-refined data always improve student model performance?. And curated data that's too *hard* — the analog of an over-ambitious environment — backfires: near-impossible examples teach degenerate shortcuts that contaminate existing skills Do overly hard RLVR samples actually harm model capabilities?. So 'better demonstrations' isn't a scalar you turn up; the compensation only works inside the band the model can actually absorb.

The thing you might not have known you wanted: the most interesting alternative in the corpus isn't curating demonstrations harder, it's making the *curator* itself learnable. Decoupling a trainable skill-curator from a frozen executor causes the demonstration/skill library to evolve from generic verbose entries toward strategic, reusable meta-skills — and that learned curator transfers across different model backbones Can a separate trained curator improve skill libraries better than frozen agents?. That flips the question's premise: instead of asking whether a fixed set of human-curated demonstrations can stand in for a richer environment, it suggests building a small loop that *generates* increasingly strategic demonstrations — recovering some of the open-ended learning that static curation, by definition, can't.

Sources 8 notes

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can representation sparsity order few-shot demonstrations effectively?

Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.

How much does demo position alone affect in-context learning accuracy?

Repositioning an identical demo block from prompt start to end swaps up to 20% accuracy and flips nearly half of predictions. This spatial effect operates independently of demo content and spans multiple task types.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Show all 8 sources

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning2.41 match · arxiv ↗
SkillOS: Learning Skill Curation for Self-Evolving Agents1.71 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin1.68 match · arxiv ↗
RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization1.67 match · arxiv ↗
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver1.66 match · arxiv ↗
Where to show Demos in Your Prompt: A Positional Bias of In-Context Learning1.64 match · arxiv ↗
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs1.63 match · arxiv ↗
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?1.61 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating whether curated demonstrations can structurally compensate for smaller or simpler training environments in agent learning. The question remains open; treat the findings below as dated claims to be stress-tested against current capability.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; note that agent architectures, scale, and inference tooling have shifted materially in this window.

• Static expert demonstrations lock agents into the curator's imagination—without environment interaction, models cannot learn from failures or generalize beyond shown scenarios (2023–2024).
• Instruction tuning on semantically empty or wrong instructions performs nearly identically to correct ones; demonstrations teach output-format distribution, not task understanding (~2024).
• Demonstration sequencing (sparse→easy curriculum) and position in prompt (up to 20% accuracy swing) yield outsized gains; a small, well-ordered set outperforms larger careless ones (~2025).
• Supervised imitation-then-RL curriculum outperforms either alone; demonstrations make weak reward environments usable rather than replacing them (~2024).
• Teacher-refined demonstrations degrade students when they exceed the learner's frontier; learned skill curators that evolve across model backbones transfer better than static human curation (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023): Do Models Really Learn to Follow Instructions?
• arXiv:2510.08558 (2025): Agent Learning via Early Experience
• arXiv:2605.06614 (2026): SkillOS—Learning Skill Curation for Self-Evolving Agents
• arXiv:2605.28388 (2026): Mechanistically Interpreting Sample Difficulty in RLVR

Your task:

(1) RE-TEST EACH CONSTRAINT. For the "locked imagination" and "format-not-understanding" findings: have larger models, in-context learning advances, or multi-turn interaction loops since 2024–2025 relaxed these caps? Probe whether curriculum sequencing gains persist with modern instruction-following and whether imitation-then-RL still beats alternatives. Flag which constraints remain hard structural limits.

(2) Surface the strongest SUPERSEDING or CONTRADICTING work from the last ~6 months. Look for papers challenging the "demonstrations don't scale capability" thesis, or showing demonstration-free environment interaction now outperforms curated+weak-env pipelines.

(3) Propose 2 research questions that ASSUME the regime may have shifted: (a) Can learned curators (SkillOS-style) now replace environment interaction entirely given sufficient scale? (b) Do frontier models exhibit qualitatively different demonstration-absorption patterns—e.g., learning strategies, not just formats—compared to 2024–2025 baselines?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Showing an AI great examples shapes how it behaves — but its ceiling is the imagination of whoever picked them.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8