INQUIRING LINE

Inquiring lines›How does AI reshape human reasonin…›How do training data and procedure…›What drives capability and cost ef…›this inquiring line

The biggest cost of running an AI agent might not be the AI itself, but all the outside services it has to call.

How much does external API latency dominate total agent execution cost?

This explores whether the slow round-trips to outside services (search engines, tool/function APIs, UI calls) are the thing that actually drives an agent's cost — and the corpus answers sideways: it treats external calls as a dominant cost driver, but measures that cost more in tokens and task-completion time than in raw network latency.

This explores whether external API latency dominates total agent execution cost. The corpus doesn't measure wall-clock latency head-on, but it converges hard on a related claim: external calls are the expensive part, and a surprising amount of agent research is really about getting rid of them. The strongest single signal is Do efficiency techniques across agent components reveal shared structural constraints? — independent work on memory, tool use, and planning all rediscover the same principle of 'minimizing external calls,' which suggests round-trips to outside systems are a structural cost pressure, not an incidental one.

Where the corpus gets concrete, the cost shows up as time rather than network milliseconds. The AXIS framework in Can API-first agents outperform UI-based agent interaction? cuts task completion time 65–70% specifically by replacing long sequences of UI interactions with direct API calls — so the interesting twist is that the slow path isn't the API, it's the chatty back-and-forth the API lets you skip. The bottleneck is the *number* of interaction steps, and a single well-chosen call collapses many slow ones.

Two training-side notes attack external-call cost so aggressively they delete the API entirely: Can LLMs replace search engines during agent training? shows a 14B model can generate search results from internal knowledge well enough to skip real search APIs during RL, and Can simulated APIs and token-level credit assignment train better tool-using agents? replaces costly real-API interactions with LLM-simulated ones. You don't simulate away something that's cheap and fast — the fact that 'fake the API with another model' is a winning move tells you real external calls were the dominant burden in that loop.

The quieter counterpoint is that for *inference*, the corpus keeps pointing at tokens, not latency, as the cost denominator. How does test-time scaling work at the agent level? finds 80% of multi-agent performance variance comes from token budget, and Do persistent agents really cost less per token? argues the right unit is completed artifacts (with 82.9% of tokens served from cache). So 'cost' splits in two: the compute/token bill, which the corpus measures carefully, and external-call latency, which it treats as a thing to engineer around rather than a thing to quantify.

The useful surprise: nobody in this collection has actually published 'external API latency is X% of total cost.' What they've published is a stack of techniques whose entire reason for existing is to avoid, batch, cache, or simulate external calls — including using cheaper small models for the repetitive tool-shaped work in Can small language models handle most agent tasks?. The dominance of external calls is the unstated premise behind the whole efficiency literature, even where no one stops to put a number on it.

Sources 7 notes

Do efficiency techniques across agent components reveal shared structural constraints?

Techniques for memory, tool learning, and planning independently converge on shared principles: context bounding, minimizing external calls, and controlled search. This convergence suggests these reflect fundamental structural pressures in agentic computation rather than component-specific optimizations.

Can API-first agents outperform UI-based agent interaction?

The AXIS framework shows that prioritizing API calls over sequential UI interactions cuts task completion time by 65–70% while maintaining 97–98% accuracy and reducing cognitive workload by 38–53%. A self-exploration mechanism automatically discovers and constructs APIs from existing applications, solving the bootstrapping problem.

Can LLMs replace search engines during agent training?

ZeroSearch and SSRL demonstrate that LLMs can generate relevant documents and search results from internal knowledge, with 14B simulators matching or exceeding real search engines. Curriculum degradation and test-time scaling optimize this approach for training without API costs.

Can simulated APIs and token-level credit assignment train better tool-using agents?

ToolPO replaces costly real-API interactions with LLM-simulated ones and assigns credit directly to tool-invocation tokens rather than spreading outcome rewards across trajectories. This combination improves training stability and sample efficiency for tool-using agents.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Show all 7 sources

Do persistent agents really cost less per token?

A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Towards a Science of Scaling Agent Systems5.02 match · arxiv ↗
Toward Efficient Agents: A Survey of Memory, Tool Learning, and Planning2.50 match · arxiv ↗
Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets2.49 match · arxiv ↗
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries2.49 match · arxiv ↗
Artifacts as Memory Beyond the Agent Boundary2.44 match · arxiv ↗
From Model Scaling to System Scaling: Scaling the Harness in Agentic AI2.44 match · arxiv ↗
Small Language Models are the Future of Agentic AI1.74 match · arxiv ↗
How we built our multi-agent research system1.72 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking real-time shifts in agent execution economics. The question remains: How much does external API latency dominate total agent execution cost?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. A library of agent research converges on minimizing external calls as a structural efficiency principle, but does NOT directly measure latency as a % of total cost:
• External API calls are treated as the primary cost pressure; work on memory, tool learning, and planning independently rediscover 'minimize round-trips' (~2026).
• AXIS-style direct API replacement cuts task completion time 65–70% by collapsing chatty interaction sequences into single calls (~2024–09).
• LLM-simulated APIs (replacing real external calls with internal model generations) and search-from-knowledge approaches eliminate API costs entirely, suggesting real calls were the dominant burden (~2025–05).
• For inference cost, token budget (not latency) explains 80% of multi-agent performance variance; 82.9% of tokens served from cache in persistent environments (~2026–05).
• Cost unit is fragmenting: compute/token measured precisely; external-call latency engineered around rather than quantified.

Anchor papers (verify; mind their dates):
• arXiv:2601.14192 (2026-01): Toward Efficient Agents — convergence of memory, tool, planning constraints.
• arXiv:2505.04588 (2025-05): ZeroSearch — LLM internal knowledge replaces search APIs.
• arXiv:2409.17140 (2024-09): AXIS — 65–70% task time reduction via API-first interaction.
• arXiv:2605.26112 (2026-05): System Scaling — cache prevalence and economic units.

Your task:
(1) RE-TEST EACH CLAIM. For every efficiency technique (simulation, batching, caching, small-model routing), determine whether newer inference engines (speculative decoding, continuous batching, GPU-accelerated orchestration) or cheaper external APIs (cost/latency ratio) have REVERSED the dominance of external calls. Is 'minimize API calls' still the primary lever, or has latency become negligible relative to token compute? Separate the durable question (are external calls expensive?) from the perishable constraint (is minimizing them the #1 optimization?). Ground your answer in recent arXiv work.
(2) Surface the strongest CONTRADICTING work from the last 6 months: any paper showing external latency is NOT dominant, or that token cost now overwhelms call overhead.
(3) Propose 2 research questions that ASSUME the regime has shifted—e.g., 'If GPU-batch inference and cached APIs make latency sub-1ms, does the *variance* in latency (jitter) now dominate cost?', or 'Does multi-agent orchestration (concurrent API calls) change the cost equation?'

Cite arXiv IDs; flag anything you cannot ground in a real paper.

The biggest cost of running an AI agent might not be the AI itself, but all the outside services it has to call.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8