SYNTHESIS NOTE

Can careful selection of 78 demos outperform massive training datasets?

Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.

Synthesis note · 2026-02-23 · sourced from Agents

The LIMI paper challenges the core assumption that agentic capability scales with training data volume. Using only 78 carefully designed training samples — capturing complete multi-turn interaction sequences including tool use, reasoning, and environmental feedback — LIMI achieves 73.5% on AgencyBench, dramatically outperforming Kimi-K2-Instruct (24.1%), DeepSeek-V3.1 (11.9%), Qwen3-235B-A22B-Instruct (27.5%), and GLM-4.5 (45.1%). Most strikingly, LIMI shows 53.7% improvement over models trained on 10,000 samples.

Three innovations drive this:

Agentic query synthesis — human-AI collaborative collection from real-world scenarios plus systematic GitHub PR-based synthesis, ensuring ecological validity
Complete trajectory collection — full multi-turn sequences from task understanding through tool utilization to successful completion, not isolated demonstrations
The Agency Efficiency Principle — machine autonomy emerges from strategic curation, not data accumulation

This extends a pattern now documented across three capability domains: reasoning (LIMO achieved complex math with 817 samples), instruction-following (LIMA achieved alignment with 1,000 examples), and now agency. Because Do base models already contain hidden reasoning ability?, the mechanism is likely the same: curated demonstrations activate latent agentic patterns already embedded through pretraining on code, documentation, and workflow descriptions. The training data doesn't teach agency — it triggers the phase transition from passive language model to active agent.

The practical implication challenges the resource-intensive approach to building agentic systems. If 78 demonstrations outperform 10K, the bottleneck is data quality and trajectory design, not data volume. Since Can models improve themselves on tasks without verifiable answers?, there appears to be a consistent principle: capability activation requires showing the model what it looks like to use a capability, not exhaustive training.

Inquiring lines that read this note 4

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How does example difficulty affect learning efficiency in language models?

Does selecting examples from multiple complexity levels outperform selecting only high-quality examples?

What drives capability and cost efficiency in agent systems?

How much does agent performance depend on demonstration quantity versus curation quality?

How can AI agents autonomously learn and transfer skills across tasks?

What specific qualities make some demonstrations more effective for agency training?

What makes weaker teacher models effective for stronger student training?

Can we cheaply estimate which samples are currently most informative?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 147 in 2-hop network ·medium cluster Open in graph ↗

Can careful selection of 78 demos outperform mas… Do base models already contain hidden reasoning ab… Can models improve themselves on tasks without ver… Can a single training example unlock mathematical … Can we train better models on less data? Can agents learn continuously from experience with…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
extends: the same minimal-activation principle applies to agency, not just reasoning
Can models improve themselves on tasks without verifiable answers? Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?
parallel finding: 1000 demonstrations activate reasoning; 78 demonstrations activate agency
Can a single training example unlock mathematical reasoning? Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.
strongest version of minimal-data activation; LIMI is the agentic equivalent
Can we train better models on less data? Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.
complementary mechanism: influence estimation identifies which data matters
Can agents learn continuously from experience without updating weights? This explores whether LLM agents can adapt to new tasks and failures by retrieving past experiences from memory alone, rather than requiring expensive parameter fine-tuning or rigid hardcoded rules.
AgentFly's case bank grows from experience; the 78-demonstration efficiency principle suggests a small number of high-quality cases may suffice for the case bank to bootstrap effective retrieval

Can careful selection of 78 demos outperform massive training datasets?

Inquiring lines that read this note 4

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4