SYNTHESIS NOTE

Topics›Work Application Use Cases›this note

Do persistent agents really cost less per token?

When AI agents reuse cached context across tasks, does the standard cost-per-token metric still reveal true economic efficiency? A case study suggests the answer may be no.

Synthesis note · 2026-05-28 · sourced from Work Application Use Cases

A 115-day case study of one physician-scientist running a persistent agentic research environment found that 82.9% of recorded May tokens were cache reads. The workflow was cache-dominant: the agent increasingly reasoned over reused accumulated context rather than fresh inference. The author's inference is that persistent agentic environments may shift the economic unit from cost per token to cost per completed artifact.

This matters because cost-per-token is the native pricing and benchmarking unit, and it systematically misleads about persistent agents. When most tokens are cheap cache reads against a durable memory layer, the marginal token tells you almost nothing about the cost of getting useful work done — the expensive resource is the accumulated context and reusable procedures that make each new task cheap. Two agents with identical token counts can differ enormously in artifacts produced.

The counterpoint is that cost-per-artifact is hard to standardize — "artifact" is fuzzy (a paragraph? a paper? a repository?) and reproducible artifact-level denominators barely exist, which is exactly why the field defaults to tokens. But defaulting to the measurable wrong unit is still wrong. Therefore the methodological recommendation that follows is concrete: future evaluations should adopt artifact-level denominators and cost-per-artifact estimates, because the economics of a stateful, cache-dominant agent live at the artifact level, not the token level.

Inquiring lines that read this note 52

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does tokenized intelligence retain genuine value through exchange-based systems?

How do prompt structure and constraints affect model instruction reliability?

What factors beyond surface content determine how readers extract meaning differently?

What is craft-residue and why does its loss matter?

Why do multi-turn conversations degrade AI intent and coherence?

Does distributed serving defeat the identity of a single virtual instance?

How can LLM user simulators model realistic goal-driven conversation?

How do chatbots affect human self-disclosure and emotional engagement?

Why do persistent chatbot companions face novelty decay that ad-hoc supporters avoid?

Does externalizing cognitive work and state improve agent reliability?

How does AI adoption affect human skill development and labor equality?

What drives capability and cost efficiency in agent systems?

Can inference-time compute substitute for scaling up model parameters?

How does test-time scaling relate to token budget in agentic deep research?

How do multi-agent systems achieve genuine cooperation and reasoning?

How should personalization be implemented to improve AI assistant effectiveness?

What production costs does personalization infrastructure impose on AI systems?

How should inference compute be adaptively allocated based on prompt difficulty?

How does sequence length affect sparsity tolerance in models?

What is the cost difference between filtering context versus attending to everything?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Why does recomputing weights cost less than moving them on phones?

How should agents balance memory condensation to optimize context efficiency?

Can single-axis benchmarks accurately predict agent deployment success?

How should benchmarks measure agent efficiency across all three cost dimensions?

How should systems govern persistent agent-generated code in shared infrastructure?

When does architectural design matter more than raw model capacity?

Why do frontier models remain cost-effective despite higher token prices in production?

Which computational strategies best support reasoning in language models?

What is the relationship between prefix sharing and speculative decoding?

What role does compression play in language model capability and generalization?

How should memory consolidation strategies shape agent performance over time?

How does durable memory quality shape agent performance over time?

Why do self-improving systems struggle without clear external performance metrics?

Why do persistent AI systems require fundamentally different design than ad-hoc supporters?

What memory architectures best support persistent reasoning across extended interactions?

How should we measure operational cost of memory systems in production?

How does memorization interact with learning and generalization?

How much improvement comes from caching versus actual capability gain?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 86 in 2-hop network ·medium cluster Open in graph ↗

Do persistent agents really cost less per token? Why does agent efficiency differ from model size r… Should agent evaluation measure more than task suc… What makes agent-authored code worth persisting an… Will agents compete for attention just like users …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why does agent efficiency differ from model size reduction? Explores why making models smaller doesn't solve agent cost problems. Agents loop recursively, compounding costs multiplicatively, so efficiency requires system-level design, not just parameter reduction.
extends: both reject per-token accounting for agents, this note via cache-dominant economics, that note via the success-versus-cost frontier as the right metric
Should agent evaluation measure more than task success? Current benchmarks reduce agents to a single success score, but agents emerge from multiple interacting systems. What dimensions of agent behavior should builders actually measure to predict deployment readiness?
synthesizes: cost-per-artifact is the economic counterpart to the trajectory-level evaluation this note's denominator demands
What makes agent-authored code worth persisting and sharing? Agent-created artifacts like patches, tests, and skill libraries outlive single tasks, but we lack guidance on what should persist, how to maintain consistency across agents, and when persistence is worth the engineering effort.
grounds the artifact unit: the persistent, reusable artifacts that make each new task cheap are exactly the cache-dominant durable layer driving the cost shift
Will agents compete for attention just like users do? As autonomous agents take over user tasks, will the Web's economic competition shift from human clicks to agent invocations? This explores whether existing ad-market mechanisms could scale to agent decision-making.
synthesizes: both relocate the economic unit away from human-facing metrics (clicks, tokens) toward agent-completed work

Do persistent agents really cost less per token?

Inquiring lines that read this note 52

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4