Why do production AI agents stay deliberately simple?
Production AI agents operate far simpler than research suggests—most execute under 10 steps and avoid third-party frameworks. What explains this gap between research ambition and deployment reality?
"Measuring Agents in Production" (2024) presents the first large-scale systematic study of AI agents deployed in real production environments — 306 practitioners surveyed, 20 in-depth case studies via interviews, across 26 domains.
The findings directly challenge the complexity narrative in agent research:
Simple methods dominate. 70% of deployed agents use off-the-shelf models without weight tuning, relying entirely on prompting. Teams select the most capable, expensive frontier models available because cost and latency remain favorable compared to human baselines. 79% rely heavily on manual prompt construction, and production prompts can exceed 10,000 tokens.
Autonomy is deliberately constrained. 68% of production agents execute at most 10 steps before requiring human intervention. 47% execute fewer than 5 steps. This is not a capability limitation — it is a design choice. Organizations constrain autonomy to maintain reliability, the top development challenge.
Custom builds over frameworks. 85% of detailed case studies forgo third-party agent frameworks, building custom agent applications from scratch. This suggests that current frameworks do not match production requirements — since Why do protocol-based tool integrations fail in production workflows?, the preference for custom builds reflects a reliability imperative.
Human evaluation persists. 74% depend primarily on human evaluation. Automated evaluation has not displaced human judgment in production, consistent with Does setting temperature to zero actually make LLM outputs reliable? — single automated evaluations are insufficient for reliability-critical deployment.
The gap between research and production is stark. Research pushes toward multi-agent systems, complex reasoning chains, and autonomous tool use. Production gravitates toward well-scoped, static workflows with human-in-the-loop. Since Why do AI agents fail at workplace social interaction?, the production community has learned this lesson and constrains accordingly.
The practical implication: "simple yet effective methods already enable agents to deliver impact across diverse industries." Complexity is not required for production value — and may be counterproductive when reliability is the binding constraint.
Inquiring lines that use this note as a source 2
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do AI agents fail at workplace social interaction?
Explores why current AI agents struggle most with communicating and coordinating with colleagues in realistic workplace settings, despite strong reasoning capabilities in other domains.
benchmark evidence for why production constrains autonomy
-
Why do protocol-based tool integrations fail in production workflows?
Explores whether standardized tool protocols like MCP introduce non-determinism that undermines agent reliability, and what causes ambiguous tool selection in production systems.
the reliability imperative behind custom builds
-
Can small language models handle most agent tasks?
Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
production data confirms: most agent work IS repetitive and scoped
-
Why do capable AI agents still fail in real deployments?
Explores whether agent failures stem from insufficient capability or from missing ecosystem conditions like user trust, value clarity, and social norms. Understanding this distinction matters for predicting which agents will succeed.
production agents succeed by satisfying ecosystem conditions, not by maximizing capability
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Measuring Agents in Production
- What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity
- A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows
- LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
- TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
- Why Do Multi-agent LLM Systems Fail?
- LIMI: Less is More for Agency
- Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks
Original note title
production AI agents are deliberately simple and constrained — 68 percent execute at most 10 steps and 85 percent forgo third-party frameworks in favor of custom builds