Does upgrading model capability improve token efficiency in agentic systems?
This explores whether a smarter model gets more done per token in agent systems — or whether token efficiency comes from somewhere other than raw model capability.
This question reads model capability and token efficiency as a trade-off, and the corpus has a surprisingly direct answer to it. Anthropic's internal evals on multi-agent research found that token spending alone explains roughly 80% of performance variance — but crucially, upgrading the model delivered *larger* gains than doubling the token budget Does token spending drive multi-agent research performance?. So yes: a more capable model is the more efficient way to buy performance, because it converts each token into more progress than simply spending more tokens does. The same line of work frames multi-agent performance as fundamentally a token-spending function, while pointing to architectures like shared-KV-cache that try to decouple the gains from the costs How does test-time scaling work at the agent level?.
But the more interesting move in the corpus is to question the premise that capability is where efficiency lives at all. A recurring finding is that reliability and efficiency come from the *harness* around the model — externalizing memory, skills, and protocols into system structure so the model doesn't re-solve the same problems on every call Where does agent reliability actually come from?. Turning a language model into a capable agent isn't a matter of a better model either; it requires transforming the whole pipeline — data, action grounding, infrastructure, and safety — not just retraining Can you turn an LLM into an agent by just fine-tuning?.
That reframing flips the economics. One line of research argues small language models are *sufficient* for most agentic subtasks — the repetitive, well-defined work that makes up the bulk of an agent's job — at 10–30× lower cost, making a heterogeneous design (small models by default, large models only when needed) the rational pattern Can small language models handle most agent tasks?. Here, efficiency comes not from upgrading capability but from *matching* capability to each subtask. And once context persists across long runs, the meaningful denominator stops being the token entirely: a 115-day study found 82.9% of tokens were cache reads, shifting the economic unit from cost-per-token to cost-per-completed-artifact Do persistent agents really cost less per token?.
The corpus also shows efficiency gains that have nothing to do with the model's intelligence. Shared-prefix tree rollouts produce more distinct trajectories per fixed token budget than independent sampling, improving long-horizon work under the same compute ceiling Can shared-prefix trees reduce redundancy in agent rollouts?. Autonomous memory folding compresses interaction history into structured schemas, cutting token overhead while letting agents pause and rethink Can agents compress their own memory without losing critical details?. Both buy efficiency through structure, not scale.
The takeaway you might not have expected: upgrading the model *does* improve token efficiency, and per the data it beats just spending more — but it's one of the weaker levers in the toolkit. Caching, right-sizing the model to the subtask, externalizing memory, and smarter rollout structure each move the efficiency needle without touching capability at all. The agentic-systems literature has quietly relocated efficiency from the model into the harness.
Sources 8 notes
Anthropic's internal evals show token spending alone accounts for 80% of performance variance in multi-agent research systems. Model capability upgrades deliver larger gains than doubling token budget, suggesting efficiency matters as much as quantity.
Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.
Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.
SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.
A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.
Tree-structured rollouts that branch from shared prefixes produce more distinct trajectories within a fixed token budget than independent chain sampling. This improves advantage estimation statistics and enables longer-horizon tasks within the same compute constraint.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.