INQUIRING LINE

What metrics replace throughput per token for agent deployment?

This explores what we should measure instead of tokens-per-second once agents run as persistent, long-horizon systems rather than single prompt-response calls.


This explores what we should measure instead of tokens-per-second once agents run as persistent, long-horizon systems — and the corpus suggests the denominator itself changes, not just the metric. The most direct answer is that the meaningful unit shifts from the token to the completed artifact. A 115-day case study found that 82.9% of tokens were cache reads, which means counting raw tokens badly misrepresents what work actually cost; when context persists and gets reused, the honest denominator is finished pieces of work, not individual tokens Do persistent agents really cost less per token?.

But cost-per-artifact is only one axis. A recurring theme is that a single number — whether throughput or task-success — hides the multidimensional behavior that actually determines whether an agent is deployable. One line of work argues capability is a *vector* across separable axes: task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness, where models that top one axis often rank low on another, making single-score rankings systematically misleading Does a single benchmark score actually predict agent readiness?. A closely related argument reframes evaluation around the *trajectory* rather than the endpoint, proposing benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost What should we actually measure in agent evaluation?. Notice the overlap: "context efficiency" and "verification cost" are throughput-adjacent metrics, but normalized against useful progress rather than raw generation speed.

The reason these replacements matter becomes sharp when you look at multi-agent systems, where token spending is exposed as a confound rather than a virtue. Several notes converge on the finding that roughly 80% of multi-agent performance variance is explained by token budget alone, not coordination intelligence — systems can burn 15× more tokens than a single agent, with coordination yielding negative returns past a certain accuracy threshold Does token spending drive multi-agent research performance? Are multi-agent systems actually intelligent coordination or just token spending? How does test-time scaling work at the agent level?. If more tokens almost mechanically buy more performance, then throughput-per-token tells you nothing about whether the architecture is good — it just tells you how hard you stepped on the gas. The useful metric becomes performance *per dollar* or *per artifact* with token spend held constant, which is exactly why heterogeneous designs that route most subtasks to small models at 10–30× lower cost look economically rational Can small language models handle most agent tasks?.

There's a more surprising candidate metric hiding here, too: how much *learning signal* the deployment generates. One line argues every agent action emits a next-state signal — a user reply, a tool output, an error, a changed GUI — that can train the policy directly, turning deployment itself into a training loop Can agent deployment itself generate training signals automatically?. Under that lens, a deployed agent's value isn't only artifacts produced but usable signal per interaction. And on the efficiency side, methods like shared-prefix tree rollouts measure *distinct trajectories per token budget* — squeezing more independent learning out of the same compute — while tool-call credit assignment improves *sample efficiency* by attributing reward to the tokens that mattered Can shared-prefix trees reduce redundancy in agent rollouts? Can simulated APIs and token-level credit assignment train better tool-using agents?.

So the replacement isn't one metric but a small family, organized by what you actually care about: cost-per-artifact (economics), the capability vector and trajectory-quality measures (deployability), performance-at-fixed-token-budget (architecture honesty), and signal-per-interaction (learning value). The thread connecting all of them is that they normalize against *useful work accomplished*, which is precisely what raw throughput-per-token erases.


Sources 10 notes

Do persistent agents really cost less per token?

A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.

Does a single benchmark score actually predict agent readiness?

Capability decomposes into task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness. Models ranked highest on one axis often rank lower on others, making single-score evaluations systematically misleading for real deployment.

What should we actually measure in agent evaluation?

Single-score evaluation collapses multi-dimensional agent behavior and creates false confidence in deployment readiness. Research shows agents need benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost to reflect actual system performance.

Does token spending drive multi-agent research performance?

Anthropic's internal evals show token spending alone accounts for 80% of performance variance in multi-agent research systems. Model capability upgrades deliver larger gains than doubling token budget, suggesting efficiency matters as much as quantity.

Are multi-agent systems actually intelligent coordination or just token spending?

Research shows token usage explains 80% of multi-agent performance variance, systems use 15× more tokens than single agents, and coordination yields negative returns above 45% accuracy. Performance gains come from token distribution, not coordination sophistication.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Can agent deployment itself generate training signals automatically?

Every agent action produces a next-state signal (user reply, tool output, error, GUI change) that can train the policy directly. This universal signal source eliminates the need for separate training datasets across conversations, terminal tasks, SWE, and tool use.

Can shared-prefix trees reduce redundancy in agent rollouts?

Tree-structured rollouts that branch from shared prefixes produce more distinct trajectories within a fixed token budget than independent chain sampling. This improves advantage estimation statistics and enables longer-horizon tasks within the same compute constraint.

Can simulated APIs and token-level credit assignment train better tool-using agents?

ToolPO replaces costly real-API interactions with LLM-simulated ones and assigns credit directly to tool-invocation tokens rather than spreading outcome rewards across trajectories. This combination improves training stability and sample efficiency for tool-using agents.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a deployment engineer evaluating whether agent systems merit investment. The question remains open: what metrics actually predict whether a deployed agentic system will be useful and economical?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat these as constraints to re-test:
- Cost-per-artifact, not tokens-per-second, becomes the honest denominator when 82.9% of tokens are cache reads in persistent agents (2026-05, arXiv:2605.26870).
- Capability is a vector across task success, privacy, long-horizon retention, mode-shift behavior, and ecosystem readiness; single-axis rankings systematically mislead (synthesis from 2025-03, arXiv:2503.16416).
- ~80% of multi-agent performance variance is token-budget alone; coordination adds noise past accuracy thresholds; single-agent systems outperform multi-agent on equal token budget (2026-04, arXiv:2604.02460).
- Small models (10–30× cheaper) suffice for most agentic subtasks; routing heterogeneously is economically rational (2025-06, arXiv:2506.02153).
- Learning signal per interaction—next-state feedback from user replies, tool outputs, errors—is a live training loop metric (synthesis from path).

Anchor papers (verify; mind their dates):
- arXiv:2605.26870 (Persistent AI Agents, 2026-05): empirical case study on cache efficiency and cost-per-artifact.
- arXiv:2604.02460 (Single-Agent LLMs Outperform Multi-Agent, 2026-04): contradicts multi-agent hype; token-budget parity method.
- arXiv:2506.02153 (Small Language Models Future, 2025-06): cost-performance frontier for agentic subtasks.
- arXiv:2509.21240 (Tree Search for LLM Agent RL, 2025-09): trajectory-quality and sample-efficiency metrics.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, ask: have newer model architectures (e.g., MoE, on-device), inference methods (speculative decoding, continuous batching), caching layers (semantic or prefix trees), or live evaluation harnesses since ~Dec 2026 relaxed or overturned these ceilings? Separate the durable question—what *should* we measure for deployable agents?—from the perishable limitation (e.g., "small models can't do X"). If a constraint still holds, say plainly; if it's been resolved, cite what resolved it.

(2) Surface the strongest contradicting or superseding work from the last ~6 months. Look for papers that claim single-agent systems are no longer optimal, that multi-agent coordination has gained signal-to-token ratio, or that per-artifact metrics miss a critical dimension.

(3) Propose 2 research questions that assume the regime may have moved: e.g., "If cache efficiency has reached saturation, is verification cost the new honest denominator?" or "Does learning-signal-per-interaction scale with model scale and deployment duration?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines