Do persistent agents really cost less per token?
When AI agents reuse cached context across tasks, does the standard cost-per-token metric still reveal true economic efficiency? A case study suggests the answer may be no.
A 115-day case study of one physician-scientist running a persistent agentic research environment found that 82.9% of recorded May tokens were cache reads. The workflow was cache-dominant: the agent increasingly reasoned over reused accumulated context rather than fresh inference. The author's inference is that persistent agentic environments may shift the economic unit from cost per token to cost per completed artifact.
This matters because cost-per-token is the native pricing and benchmarking unit, and it systematically misleads about persistent agents. When most tokens are cheap cache reads against a durable memory layer, the marginal token tells you almost nothing about the cost of getting useful work done — the expensive resource is the accumulated context and reusable procedures that make each new task cheap. Two agents with identical token counts can differ enormously in artifacts produced.
The counterpoint is that cost-per-artifact is hard to standardize — "artifact" is fuzzy (a paragraph? a paper? a repository?) and reproducible artifact-level denominators barely exist, which is exactly why the field defaults to tokens. But defaulting to the measurable wrong unit is still wrong. Therefore the methodological recommendation that follows is concrete: future evaluations should adopt artifact-level denominators and cost-per-artifact estimates, because the economics of a stateful, cache-dominant agent live at the artifact level, not the token level.
Inquiring lines that use this note as a source 45
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does token-based production differ from digital file production?
- How does token generation as flow differ from print's archival storage?
- How does tokenization differ from commodity production in capitalism?
- What is craft-residue and why does its loss matter?
- How does the token frame predict different economic outcomes than commodity framing?
- Why do tokens need validators while commodities need standardization?
- Does distributed serving defeat the identity of a single virtual instance?
- Why does distributed serving infrastructure defeat hardware-instance accounts of the interlocutor?
- Why do persistent chatbot companions face novelty decay that ad-hoc supporters avoid?
- What happens to token value when populations surrender cognitively at different rates?
- How do virtual model instances preserve identity through load-balancing and failover?
- Are threads or virtual instances better candidates than hardware for the interlocutor?
- Why would compute-replacement cost determine wages instead of productivity?
- Do multi-agent systems justify their token costs with genuine quality gains?
- Why do multi-agent systems use 15 times more tokens than chat interactions?
- How does test-time scaling relate to token budget in agentic deep research?
- Does upgrading model capability improve token efficiency in agentic systems?
- Do latent communication approaches truly escape token economics constraints?
- What production costs does personalization infrastructure impose on AI systems?
- How should token budgets be allocated when prompt-inference coupling matters?
- Can prompt optimization for clarity automatically improve token efficiency?
- What is the cost difference between filtering context versus attending to everything?
- Why does recomputing weights cost less than moving them on phones?
- Can latent communication reduce the token cost of multi-agent systems?
- When is 15x token overhead actually worth the compute cost?
- How should we measure context efficiency and verification cost in agents?
- How much does external API latency dominate total agent execution cost?
- How should benchmarks measure agent efficiency across all three cost dimensions?
- How do agents decide which created code should persist versus disappear?
- Can one-off agent code be safely promoted to durable infrastructure?
- What metrics replace throughput per token for agent deployment?
- How do tool invocations drive agentic cost beyond token consumption?
- Should artifact-level benchmarks replace token counts for agent evaluation?
- How do cache-dominant workflows change the marginal cost of agent tasks?
- Can two agents with identical token counts produce vastly different outputs?
- Why do frontier models remain cost-effective despite higher token prices in production?
- How much does shared-prefix sampling reduce token redundancy empirically?
- What is the relationship between prefix sharing and speculative decoding?
- Can KV cache pruning serve as an alternative to consolidation?
- How does durable memory quality shape agent performance over time?
- How will the agent economy reshape compute infrastructure design?
- How does external context control compare to agents managing their own state internally?
- What mechanisms enable some firms to adopt AI more cheaply than others?
- Which firms capture the cost advantages from labor-to-AI substitution?
- What makes persistent, shared code artifacts from agents hard to manage at scale?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why does agent efficiency differ from model size reduction?
Explores why making models smaller doesn't solve agent cost problems. Agents loop recursively, compounding costs multiplicatively, so efficiency requires system-level design, not just parameter reduction.
extends: both reject per-token accounting for agents, this note via cache-dominant economics, that note via the success-versus-cost frontier as the right metric
-
What should we actually measure in agent evaluation?
Current agent benchmarks reduce performance to a single success metric, potentially hiding critical differences in how agents operate. What dimensions beyond task accuracy should evaluation frameworks capture?
synthesizes: cost-per-artifact is the economic counterpart to the trajectory-level evaluation this note's denominator demands
-
What makes agent-created code artifacts so hard to manage?
Agent-authored code that persists and is shared across systems raises difficult questions about what should be kept versus discarded, and how to maintain consistent state when multiple agents collaborate on the same artifacts.
grounds the artifact unit: the persistent, reusable artifacts that make each new task cheap are exactly the cache-dominant durable layer driving the cost shift
-
Will agents compete for attention just like users do?
As autonomous agents take over user tasks, will the Web's economic competition shift from human clicks to agent invocations? This explores whether existing ad-market mechanisms could scale to agent decision-making.
synthesizes: both relocate the economic unit away from human-facing metrics (clicks, tokens) toward agent-completed work
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study
- How we built our multi-agent research system
- Toward Efficient Agents: A Survey of Memory, Tool Learning, and Planning
- Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
- Artifacts as Memory Beyond the Agent Boundary
- Towards a Science of Scaling Agent Systems
- From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
- LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Original note title
persistent agentic environments shift the economic unit from cost per token to cost per completed artifact