Can careful selection of 78 demos outperform massive training datasets?
Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
The LIMI paper challenges the core assumption that agentic capability scales with training data volume. Using only 78 carefully designed training samples — capturing complete multi-turn interaction sequences including tool use, reasoning, and environmental feedback — LIMI achieves 73.5% on AgencyBench, dramatically outperforming Kimi-K2-Instruct (24.1%), DeepSeek-V3.1 (11.9%), Qwen3-235B-A22B-Instruct (27.5%), and GLM-4.5 (45.1%). Most strikingly, LIMI shows 53.7% improvement over models trained on 10,000 samples.
Three innovations drive this:
- Agentic query synthesis — human-AI collaborative collection from real-world scenarios plus systematic GitHub PR-based synthesis, ensuring ecological validity
- Complete trajectory collection — full multi-turn sequences from task understanding through tool utilization to successful completion, not isolated demonstrations
- The Agency Efficiency Principle — machine autonomy emerges from strategic curation, not data accumulation
This extends a pattern now documented across three capability domains: reasoning (LIMO achieved complex math with 817 samples), instruction-following (LIMA achieved alignment with 1,000 examples), and now agency. Because Do base models already contain hidden reasoning ability?, the mechanism is likely the same: curated demonstrations activate latent agentic patterns already embedded through pretraining on code, documentation, and workflow descriptions. The training data doesn't teach agency — it triggers the phase transition from passive language model to active agent.
The practical implication challenges the resource-intensive approach to building agentic systems. If 78 demonstrations outperform 10K, the bottleneck is data quality and trajectory design, not data volume. Since Can models improve themselves on tasks without verifiable answers?, there appears to be a consistent principle: capability activation requires showing the model what it looks like to use a capability, not exhaustive training.
Inquiring lines that use this note as a source 4
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does selecting examples from multiple complexity levels outperform selecting only high-quality examples?
- How much does agent performance depend on demonstration quantity versus curation quality?
- What specific qualities make some demonstrations more effective for agency training?
- Can we cheaply estimate which samples are currently most informative?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
extends: the same minimal-activation principle applies to agency, not just reasoning
-
Can models improve themselves on tasks without verifiable answers?
Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?
parallel finding: 1000 demonstrations activate reasoning; 78 demonstrations activate agency
-
Can a single training example unlock mathematical reasoning?
Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.
strongest version of minimal-data activation; LIMI is the agentic equivalent
-
Can we train better models on less data?
Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.
complementary mechanism: influence estimation identifies which data matters
-
Can agents learn continuously from experience without updating weights?
This explores whether LLM agents can adapt to new tasks and failures by retrieving past experiences from memory alone, rather than requiring expensive parameter fine-tuning or rigid hardcoded rules.
AgentFly's case bank grows from experience; the 78-demonstration efficiency principle suggests a small number of high-quality cases may suffice for the case bank to bootstrap effective retrieval
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- LIMI: Less is More for Agency
- What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity
- ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis
- LLMs Corrupt Your Documents When You Delegate
- LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
- Measuring Agents in Production
- Tree Search for LLM Agent Reinforcement Learning
- TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
Original note title
agency emerges from strategic curation of 78 demonstrations not data abundance — challenging scaling paradigms for agentic intelligence