SYNTHESIS NOTE
Agentic Systems and Tool Use

Why do production AI agents stay deliberately simple?

Production AI agents operate far simpler than research suggests—most execute under 10 steps and avoid third-party frameworks. What explains this gap between research ambition and deployment reality?

Synthesis note · 2026-03-28 · sourced from Agentic Research
How do you build domain expertise into general AI models? How does test-time scaling work at the agent level?

"Measuring Agents in Production" (2024) presents the first large-scale systematic study of AI agents deployed in real production environments — 306 practitioners surveyed, 20 in-depth case studies via interviews, across 26 domains.

The findings directly challenge the complexity narrative in agent research:

Simple methods dominate. 70% of deployed agents use off-the-shelf models without weight tuning, relying entirely on prompting. Teams select the most capable, expensive frontier models available because cost and latency remain favorable compared to human baselines. 79% rely heavily on manual prompt construction, and production prompts can exceed 10,000 tokens.

Autonomy is deliberately constrained. 68% of production agents execute at most 10 steps before requiring human intervention. 47% execute fewer than 5 steps. This is not a capability limitation — it is a design choice. Organizations constrain autonomy to maintain reliability, the top development challenge.

Custom builds over frameworks. 85% of detailed case studies forgo third-party agent frameworks, building custom agent applications from scratch. This suggests that current frameworks do not match production requirements — since Why do protocol-based tool integrations fail in production workflows?, the preference for custom builds reflects a reliability imperative.

Human evaluation persists. 74% depend primarily on human evaluation. Automated evaluation has not displaced human judgment in production, consistent with Does setting temperature to zero actually make LLM outputs reliable? — single automated evaluations are insufficient for reliability-critical deployment.

The gap between research and production is stark. Research pushes toward multi-agent systems, complex reasoning chains, and autonomous tool use. Production gravitates toward well-scoped, static workflows with human-in-the-loop. Since Why do AI agents fail at workplace social interaction?, the production community has learned this lesson and constrains accordingly.

The practical implication: "simple yet effective methods already enable agents to deliver impact across diverse industries." Complexity is not required for production value — and may be counterproductive when reliability is the binding constraint.

Inquiring lines that use this note as a source 2

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 115 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

production AI agents are deliberately simple and constrained — 68 percent execute at most 10 steps and 85 percent forgo third-party frameworks in favor of custom builds