INQUIRING LINE

Why does capability discovery become the bottleneck in large agent systems?

This explores why, once you have many agents, the hard part stops being building capable agents and becomes *finding* the right one for a job — and whether the corpus treats that 'discovery bottleneck' as real or as a symptom of something deeper.


This reads the question as being about scale: in a system with a handful of agents you wire them together by hand, but past some threshold the limiting factor becomes knowing which agent can actually do what — and the corpus suggests that bottleneck is real, but it's a consequence of two things growing at once: the *number* of agents and the *fluidity* of what each one can do.

Start with heterogeneity. The economically rational way to build agent systems isn't one big model — it's many small specialized ones, with large models called selectively Can small language models handle most agent tasks?. That design choice is what creates the problem: the more varied your fleet, the less any central router can hold a hand-maintained map of who does what. This is the explicit pitch behind treating capability matching as a first-class, indexed operation — versioned capability vectors in an HNSW index let discovery scale sub-linearly *precisely because* manual wiring breaks down as agent heterogeneity rises Can semantic capability vectors replace manual agent routing?. Discovery becomes the bottleneck because the thing you're searching over got too big and too diverse to enumerate.

The second pressure is that capabilities don't sit still. Agents accumulate reusable sub-task routines from past experience Can agents learn reusable sub-task routines from past experience?, build executable skill libraries that compose simple skills into complex ones Can agents learn new skills without forgetting old ones?, and in shared ecosystems those skills evolve across users through centralized aggregation How can agent systems share learned skills across users?. So what an agent *can* do is a moving target. You're not indexing a fixed catalog; you're tracking a continuously changing one. That's why discovery is a bottleneck and not a one-time setup cost.

But here's the turn the corpus offers — capability discovery may be the *visible* bottleneck while not being the *binding* one. Several notes argue that once agents become social and economic actors, raw capability stops being the constraint and coordination, settlement, and auditable trust take over When do agents need coordination more than raw capability?. Historical analysis finds capable agents stall for want of ecosystem conditions — trustworthiness, standardization, social acceptability — rather than capability gaps Why do capable AI agents still fail in real deployments?. And even when agents find each other, coordination degrades predictably with network scale: they agree too late, or adopt strategies without telling neighbors, and they accept information from peers without verification Why do multi-agent systems fail to coordinate at scale?. In that light, 'finding the right capability' is the easy half; trusting and orchestrating it is the hard half.

The thing you might not have known you wanted: the deepest framing here is that all of these — discovery, memory, coordination — may be the same structural pressure wearing different masks. Reliability comes from externalizing memory, skills, and protocols into a harness layer instead of cramming everything into the model Where does agent reliability actually come from?, and efficiency techniques across memory, tool use, and planning independently converge on the same principles — bound your context, minimize external calls, control your search Do efficiency techniques across agent components reveal shared structural constraints?. Capability discovery is exactly a 'minimize and control the search' problem. So it becomes the bottleneck not because matching is uniquely hard, but because scaling agents turns *everything* into a search-and-coordination problem the model can't hold in its head — and discovery is where that pressure surfaces first.


Sources 10 notes

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Can semantic capability vectors replace manual agent routing?

Versioned Capability Vectors embedded in HNSW indices couple semantic matching with policy and budget constraints, making capability discovery a first-class operation that scales sub-linearly as agent heterogeneity increases.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

How can agent systems share learned skills across users?

SkillClaw aggregates interaction trajectories across users, processes them through an autonomous evolver that identifies patterns and refines skills, then synchronizes updates system-wide. This converts siloed individual learning into shared capability improvement without manual curation.

When do agents need coordination more than raw capability?

Once agents hold credentials, transact value, and interact with other agents, raw model capability stops being the limiting factor. The real bottleneck becomes whether agents can coordinate reliably, settle accounts, and leave auditable evidence of their actions.

Why do capable AI agents still fail in real deployments?

Historical analysis from GPS to modern AI shows agent failures consistently result from absent ecosystem conditions—value generation, personalization, trustworthiness, social acceptability, and standardization—rather than capability gaps. Even highly capable systems stall without these five conditions.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Do efficiency techniques across agent components reveal shared structural constraints?

Techniques for memory, tool learning, and planning independently converge on shared principles: context bounding, minimizing external calls, and controlled search. This convergence suggests these reflect fundamental structural pressures in agentic computation rather than component-specific optimizations.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing constraints in multi-agent LLM systems. The question: Does capability discovery remain the binding bottleneck as agent fleets scale, or have newer models, coordination frameworks, and evaluation methods since shifted the constraint elsewhere?

What a curated library found — and when (dated claims, not current truth):
These findings span 2024–2026; treat them as perishable snapshots:
- Heterogeneous agent fleets (small specialized models + selective large-model calls) make hand-wired routing unscalable; capability vectors in HNSW indices restore sub-linear discovery (2025–26).
- Agent capabilities are fluid: they accumulate reusable sub-task routines, build executable skill libraries through synthesis, and evolve across users via centralized aggregation (2024–2026).
- Once agents become social/economic actors, coordination, settlement, and auditable trust displace raw capability matching as the binding constraint (2026).
- Distributed multi-agent coordination degrades predictably with network scale; agents agree late or adopt strategies without peer notification (2025).
- Reliability externalizes cognitive burdens into memory, skills, and protocols; efficiency across memory, tool use, and planning converges on the same principles (2026).

Anchor papers (verify; mind their dates):
- 2506.02153: Small Language Models are the Future of Agentic AI
- 2509.20175: Federation of Agents
- 2604.08224: Externalization in LLM Agents
- 2605.06614: SkillOS

Your task:
(1) RE-TEST THE CONSTRAINT SHIFT. The library claims discovery is the *visible* bottleneck but coordination/trust is the *binding* one. For each: judge whether 2025–26 harnesses, standardized MCP protocols, or new trust/audit methods have relaxed the coordination bottleneck, or whether it's sharpened. Does discovery still surface first, or do newer systems bypass it entirely (e.g., via pre-wired capability registries or zero-shot delegation)? Cite what relaxed or deepened each constraint.
(2) Surface the strongest CONTRADICTING work from the last 6 months. The library notes 2604.02460 claims single-agent LLMs outperform multi-agent systems on multi-hop reasoning—does that invalidate the scaling-discovery narrative, or does it apply only to reasoning tasks?
(3) Propose 2 research questions assuming the regime has moved: (a) If externalizing memory/skills/protocols is now the consensus, does capability discovery become a *protocol-conformance* problem (matching agents to interface specs) rather than a *semantic matching* problem? (b) As skill ecosystems mature (SkillClaw, SkillOS), does discovery shift from agent-to-agent to skill-to-task, flattening the hierarchy?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines