INQUIRING LINE

Does horizontal coordination improve with stronger individual agents?

This explores whether peer-to-peer ('horizontal') coordination among AI agents gets better when each agent is individually smarter — and the corpus answer is mostly counterintuitive: stronger individuals shrink the payoff from coordinating, and the failures that remain aren't fixed by raw capability.


Reading the question as 'do agents coordinate better as peers when each one is more capable?' — the corpus pushes back on the intuition. The most direct finding is that multi-agent advantages actually *diminish* as single-agent capability improves When do multi-agent systems actually outperform single agents?. As models get stronger, the performance gap between a lone agent and a coordinating team narrows, and a single agent often wins outright. So stronger individuals don't supercharge horizontal coordination — they erode the reason to coordinate in the first place. There's even a measured ceiling: across 180 configurations, coordination stops helping once a task is already being solved above ~45% accuracy, and topology (not agent count or smarts) is what controls whether errors get amplified or damped When does adding more agents actually help systems?.

The deeper reason is that the things that break horizontal coordination are *structural*, not capability-bound. Agents in a peer network fail by agreeing too late or by adopting a strategy without telling their neighbors — and crucially they accept neighbor information without verifying it, so one error propagates across the network even though each agent is individually capable of spotting a direct conflict Why do multi-agent systems fail to coordinate at scale?. A smarter agent that still trusts its neighbors uncritically doesn't fix that. The same pattern shows up in consensus: LLM-agent groups fail mostly through 'liveness loss' — timeouts and stalled convergence — rather than getting the answer wrong, and agreement degrades as the group grows even with no malicious agents present Can LLM agent groups reliably reach consensus together?. These are coordination-protocol problems, not intelligence problems.

There's also a sobering deflation of what 'coordination intelligence' even contributes. One analysis finds ~80% of multi-agent performance variance comes from token budget — how much the system spends — not from clever coordination How does test-time scaling work at the agent level?. That reframes a lot of apparent 'better coordination' as simply 'more compute,' which means making individual agents stronger may just be buying the same thing through a different door.

Where coordination *does* improve, the corpus suggests the lever is design, not individual horsepower. Structured artifact-sharing (agents producing standardized documents and pulling from a shared environment) beats free-form conversational exchange Does structured artifact sharing outperform conversational coordination?. Hybrid protocols with fixed external ordering but autonomous internal role-selection outperform both rigid hierarchies and fully self-organizing swarms — and notably, agents in those systems self-abstain when they're incompetent, which is a coordination behavior, not a capability one Do self-organizing agent teams outperform rigid hierarchies?. And teams can lift the floor by deactivating their weakest members at inference time via contribution scoring Can multi-agent teams automatically remove their weakest members?, which improves the *group* by editing composition rather than upgrading every agent.

The thing you might not have expected to want to know: the field increasingly argues that as agents become economic actors — holding credentials, transacting, leaving auditable records — raw model capability stops being the binding constraint entirely, and reliable coordination, settlement, and accountability become the bottleneck When do agents need coordination more than raw capability?. In that frame the question almost inverts: it's not 'does coordination improve with stronger agents?' but 'once agents are strong enough, coordination is the only thing left to improve.'


Sources 9 notes

When do multi-agent systems actually outperform single agents?

Empirical analysis shows MAS performance gaps narrow with stronger models, with SAS outperforming in many cases. Three formal defect types—node-level bottlenecks, edge-level overwhelm, and path-level error propagation—explain when single agents win.

When does adding more agents actually help systems?

Across 180 configurations, three dominant effects predict multi-agent success: tool-coordination trade-offs harm complex tasks, coordination stops helping above 45% accuracy, and topology choice controls error amplification by 4–17×. Architecture-task alignment, not agent count, determines outcomes.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Can LLM agent groups reliably reach consensus together?

Across hundreds of simulations, LLM-agent groups frequently fail to reach valid agreement due to timeouts and stalled convergence rather than subtle value corruption. Agreement degrades with group size even without Byzantine agents present.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Does structured artifact sharing outperform conversational coordination?

MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.

Do self-organizing agent teams outperform rigid hierarchies?

A 25,000-task experiment across 8 models and multiple agent counts showed that sequential protocols with external ordering but internal role selection outperform centralized systems by 14% and fully autonomous systems by 44%. Agents spontaneously invented specialized roles and self-abstained when incompetent.

Can multi-agent teams automatically remove their weakest members?

DyLAN's three-step importance scoring mechanism (propagation, aggregation, selection) quantifies individual agent contributions and automatically removes uninformative agents during inference, optimizing team composition without task-specific tuning.

When do agents need coordination more than raw capability?

Once agents hold credentials, transact value, and interact with other agents, raw model capability stops being the limiting factor. The real bottleneck becomes whether agents can coordinate reliably, settle accounts, and leave auditable evidence of their actions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether horizontal coordination improves when individual agents become more capable—a question the field has been probing since ~2023. A curated library spanning 2023–2026 found something counterintuitive:

**What a curated library found — and when (dated claims, not current truth):**
- Multi-agent advantages *diminish* as single-agent LLM capability improves; coordination stops helping once tasks reach ~45% accuracy (2024–2025).
- Coordination failures are structural (late agreement, uncritical trust, error propagation) not capability-bound; smarter agents don't fix information verification defects (2025–2026).
- LLM-agent consensus fails primarily via liveness loss (timeouts, stalled convergence), not value disagreement; even with no malicious agents, agreement degrades with group size (2026).
- ~80% of multi-agent performance variance comes from token budget (compute), not coordination design (2025).
- Coordination improves when the lever is *protocol design* (standardized artifacts, hybrid fixed–autonomous ordering, contribution scoring), not individual horsepower (2025–2026).

**Anchor papers (verify; mind their dates):**
- arXiv:2604.02460 (2026-04): Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking.
- arXiv:2603.28990 (2026-03): Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structure.
- arXiv:2603.01213 (2026-03): Can AI Agents Agree?
- arXiv:2605.23218 (2026-05): Foundation Protocol: A Coordination Layer for Agentic Society.

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (o1, o3, etc.), training methods (RLHF variants, scaffolding), orchestration (memory/caching innovations, multi-turn reasoning harnesses), or evaluation protocols have since RELAXED the 45%-accuracy ceiling, made verification-in-coordination tractable, or moved the compute/design tradeoff. Separate the durable question ('does capability solve coordination?') from the perishable limitation ('current topologies amplify unchecked errors'). Cite what resolved each.
(2) **Surface the strongest CONTRADICTING work from the last ~6 months** — papers claiming coordination *does* scale with agent strength, or that capability unlocks new coordination regimes.
(3) **Propose 2 research questions that ASSUME the regime may have shifted:** e.g., 'Do reasoning-intensive agents (o1-class) escape the 45% coordination ceiling?' or 'Can formal verification protocols embedded in agent communication solve the trust-propagation problem?'

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Next inquiring lines