SYNTHESIS NOTE
Agentic Systems and Tool Use

Does token spending drive multi-agent research performance?

Multi-agent systems outperform single agents substantially, but what actually accounts for that improvement? Is it intelligent coordination or simply spending more tokens on the same task?

Synthesis note · 2026-02-23 · sourced from Agents Multi Architecture

Anthropic's internal evaluation of their multi-agent research system reveals a surprising decomposition: on the BrowseComp evaluation, token usage by itself explains 80% of the performance variance, with the number of tool calls and model choice as the remaining two explanatory factors. Together, these three factors explain 95% of variance.

The implication is uncomfortable: multi-agent systems work primarily because they spend enough tokens, not because they coordinate intelligently. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects simultaneously before condensing the most important tokens for the lead agent. Each subagent provides separation of concerns — distinct tools, prompts, and exploration trajectories — which reduces path dependency.

However, the economics are revealing. Multi-agent with Claude Opus as lead and Claude Sonnet subagents outperforms single-agent Opus by 90.2% on breadth-first research. Agents use roughly 4× more tokens than chat interactions, and multi-agent systems use approximately 15× more tokens than chats. Upgrading to a newer Claude Sonnet is a larger performance gain than doubling the token budget on the older model — meaning model capability multiplies token efficiency.

The practical design principle: multi-agent architectures effectively scale token usage for tasks that exceed the limits of single agents. But they excel specifically at tasks involving heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools. Tasks requiring shared context or with many inter-agent dependencies are not a good fit.

Since Does search budget scale like reasoning tokens for answer quality?, the Anthropic finding extends the TTS law from search steps to token budget directly — and confirms that the scaling mechanism is fundamentally about compute quantity, not coordination quality.

Inquiring lines that use this note as a source 18

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 104 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

multi-agent research performance is primarily a token spending function — token usage explains 80 percent of variance while model choice and tool calls explain the remainder