Does token spending drive multi-agent research performance?

Multi-agent systems outperform single agents substantially, but what actually accounts for that improvement? Is it intelligent coordination or simply spending more tokens on the same task?

Synthesis note · 2026-02-23 · sourced from Agents Multi Architecture

Anthropic's internal evaluation of their multi-agent research system reveals a surprising decomposition: on the BrowseComp evaluation, token usage by itself explains 80% of the performance variance, with the number of tool calls and model choice as the remaining two explanatory factors. Together, these three factors explain 95% of variance.

The implication is uncomfortable: multi-agent systems work primarily because they spend enough tokens, not because they coordinate intelligently. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects simultaneously before condensing the most important tokens for the lead agent. Each subagent provides separation of concerns — distinct tools, prompts, and exploration trajectories — which reduces path dependency.

However, the economics are revealing. Multi-agent with Claude Opus as lead and Claude Sonnet subagents outperforms single-agent Opus by 90.2% on breadth-first research. Agents use roughly 4× more tokens than chat interactions, and multi-agent systems use approximately 15× more tokens than chats. Upgrading to a newer Claude Sonnet is a larger performance gain than doubling the token budget on the older model — meaning model capability multiplies token efficiency.

The practical design principle: multi-agent architectures effectively scale token usage for tasks that exceed the limits of single agents. But they excel specifically at tasks involving heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools. Tasks requiring shared context or with many inter-agent dependencies are not a good fit.

Since Does search budget scale like reasoning tokens for answer quality?, the Anthropic finding extends the TTS law from search steps to token budget directly — and confirms that the scaling mechanism is fundamentally about compute quantity, not coordination quality.

Inquiring lines that read this note 19

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does tokenized intelligence retain genuine value through exchange-based systems?

What happens to token value when populations surrender cognitively at different rates?

How do multi-agent systems achieve genuine cooperation and reasoning?

How does test-time aggregation affect reasoning correctness and reliability?

When does multi-agent voting help versus hurt performance on tasks?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

When do multi-agent approaches outperform single model extended thinking?

What drives capability and cost efficiency in agent systems?

Can inference-time compute substitute for scaling up model parameters?

How does test-time scaling relate to token budget in agentic deep research?

Does model scaling alone produce compositional generalization without symbolic mechanisms?

Can token probability distributions extend swarm composition across different model architectures?

When does architectural design matter more than raw model capacity?

Why do frontier models remain cost-effective despite higher token prices in production?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 105 in 2-hop network ·medium cluster Open in graph ↗

Does token spending drive multi-agent research p… Does search budget scale like reasoning tokens for… Why does parallel reasoning outperform single chai… What makes deep research fundamentally different f… Do hierarchical retrieval architectures outperform…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does search budget scale like reasoning tokens for answer quality? Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
TTS law generalizes from search steps to token budget; both show monotonic-to-diminishing-returns curves
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
multi-agent research is the system-level analog of parallel thinking; breadth-first search via parallel agents
What makes deep research fundamentally different from RAG? Explores whether current systems using the label 'deep research' actually meet a rigorous three-component definition involving multi-step gathering, cross-source synthesis, and iterative refinement, or if they're performing something narrower.
multi-agent research meets all three components; subagents handle multi-step gathering, lead handles synthesis
Do hierarchical retrieval architectures outperform flat ones on complex queries? Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.
Anthropic's lead/subagent hierarchy is an instantiation of this pattern

Does token spending drive multi-agent research performance?

Inquiring lines that read this note 19

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4