INQUIRING LINE

Inquiring lines›How does AI reshape human reasonin…›How does AI reshape human skill, a…›When do multi-agent approaches out…›this inquiring line

Should an AI agent use the same horsepower on every subtask, or match its effort to what each step actually needs?

How should proportionality constraints be implemented in agentic systems?

This explores how agentic systems should match the resources they spend — compute, model size, search depth, coordination overhead — to what a task actually demands, rather than over- or under-provisioning by default.

This explores how agentic systems should match the resources they spend — model size, token budget, search depth, coordination overhead — to what a task actually demands. The corpus doesn't use the phrase "proportionality constraints," but it circles the same idea from several directions, and the most striking finding is that in agent systems, capability is largely a *spending* decision. Research shows roughly 80% of multi-agent performance variance comes from token budget, not coordination intelligence How does test-time scaling work at the agent level? — and that search steps follow the same scaling curve as reasoning tokens, making retrieval just another compute axis you dial up or down How does test-time scaling work for individual research agents?. If performance tracks spend this directly, then deciding *how much* to spend on a given subtask isn't a tuning detail; it's the core design lever. Proportionality is what stops you from paying frontier-model prices for clerical work.

The sharpest concrete implementation is heterogeneous model routing: use small language models by default and reserve large ones for the moments that genuinely need them. SLMs handle the repetitive, well-defined subtasks that make up most agent work at 10–30× lower cost, which makes "small by default, large selectively" the economically rational pattern rather than a compromise Can small language models handle most agent tasks?. That's proportionality at the model-selection layer — and notably, the same logic shows up independently across other components. Techniques for memory, tool use, and planning all converge on the same three moves: bound the context, minimize external calls, control the search Do efficiency techniques across agent components reveal shared structural constraints?. When unrelated subsystems independently discover "spend less unless the task earns more," that's a sign proportionality reflects a structural pressure in agentic computation, not a per-component hack.

Where should the constraint actually live? The corpus suggests: not inside the model, but in the harness around it. Reliable agents externalize their cognitive burdens — state, procedural skills, structured protocols — into a harness layer rather than leaning on raw model scale Where does agent reliability actually come from?. That's the natural home for a proportionality policy too: a routing and budgeting layer that decides which model, how many search steps, and how much coordination each task gets. Representing agents as optimizable computational graphs makes this even more concrete — if nodes (operations) and edges (information flow) are explicit, you can automatically tune both the prompts *and* the connectivity, which means budget allocation becomes something you optimize rather than guess Can we automatically optimize both prompts and agent coordination?.

The part you might not expect: proportionality matters *most* precisely where adding more agents stops helping. Multi-agent coordination degrades predictably as the network grows — agents agree too late or adopt strategies without telling their neighbors Why do multi-agent systems fail to coordinate at scale? — and consensus tends to fail through liveness loss (timeouts, stalled convergence) rather than corrupted values, with agreement getting worse as group size grows even with no bad actors present Can LLM agent groups reliably reach consensus together?. So throwing more coordinating agents at a problem has a real ceiling. A proportionality constraint should therefore govern not just compute-per-task but *number of participants* — and it should bias toward composing existing protocols rather than building heavier new ones, since coordination layers win by wrapping standards like MCP rather than replacing them Should coordination protocols wrap existing systems or replace them?.

One honest gap worth naming: the corpus is rich on proportioning *machine* resources but thin on the hardest case — when to spend a *human's* attention. The closest material reframes the problem: rather than solving optimal hand-off timing directly (there's no ground truth for it), distribute the decision across six interaction touchpoints like action guards and verification When should human-agent systems ask for human help?. And as agents start holding credentials and moving value, the binding constraint shifts from capability to governance — can they coordinate, settle, and leave auditable evidence When do agents need coordination more than raw capability?. That suggests the next frontier for proportionality isn't cost-per-token at all; it's matching the *level of oversight* to the stakes of the action.

Sources 11 notes

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

How does test-time scaling work for individual research agents?

Research shows that deep research agents exhibit test-time scaling laws where search steps scale similarly to reasoning tokens, and live search outperforms memorized retrieval on knowledge-intensive tasks. Data efficiency is extreme—78 curated demonstrations outperform 10K samples for agency.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Do efficiency techniques across agent components reveal shared structural constraints?

Techniques for memory, tool learning, and planning independently converge on shared principles: context bounding, minimizing external calls, and controlled search. This convergence suggests these reflect fundamental structural pressures in agentic computation rather than component-specific optimizations.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Show all 11 sources

Can we automatically optimize both prompts and agent coordination?

Language agents represented as computational graphs—where nodes are operations and edges define information flow—reveal that CoT, ToT, and Reflexion are formally equivalent structures. This unified view enables automatic optimization of both node prompts and edge connectivity without manual redesign.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Can LLM agent groups reliably reach consensus together?

Across hundreds of simulations, LLM-agent groups frequently fail to reach valid agreement due to timeouts and stalled convergence rather than subtle value corruption. Agreement degrades with group size even without Byzantine agents present.

Should coordination protocols wrap existing systems or replace them?

Research shows that agent coordination standards achieve adoption by composing existing protocols like MCP and DIDComm under a shared substrate, rather than competing to replace them. Bridging lets value accrue incrementally without forcing ecosystem-wide rewrites.

When should human-agent systems ask for human help?

Magentic-UI identifies co-planning, co-tasking, action guards, verification, memory, and multitasking as mechanisms that work around the lack of ground truth for optimal deferral timing. Rather than solving the timing problem directly, these mechanisms distribute decision-making across multiple touchpoints.

When do agents need coordination more than raw capability?

Once agents hold credentials, transact value, and interact with other agents, raw model capability stops being the limiting factor. The real bottleneck becomes whether agents can coordinate reliably, settle accounts, and leave auditable evidence of their actions.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Towards a Science of Scaling Agent Systems5.92 match · arxiv ↗
Scaling Behavior of Single LLM-Driven Multi-Agent Systems4.17 match · arxiv ↗
Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures4.17 match · arxiv ↗
From Model Scaling to System Scaling: Scaling the Harness in Agentic AI3.34 match · arxiv ↗
LLMs Corrupt Your Documents When You Delegate3.31 match · arxiv ↗
Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets2.54 match · arxiv ↗
AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs2.50 match · arxiv ↗
Artifacts as Memory Beyond the Agent Boundary2.47 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about proportionality constraints in agentic systems. The question remains: *how should agentic systems match resource spend to task demand?* This is still open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat all as perishable:
• ~80% of multi-agent performance variance comes from token budget, not coordination logic (2025–2026).
• Search steps and reasoning tokens follow identical scaling curves; retrieval is a compute axis you dial (2025).
• Heterogeneous model routing (small by default, large selectively) reduces cost 10–30× on routine subtasks (2026).
• Coordination degrades predictably with network scale; agents fail through liveness loss (timeouts), not value corruption (2026).
• Proportionality constraints live best in the *harness* (routing/budgeting layer), not inside the model; they're optimizable via computational graphs (2024–2026).

Anchor papers (verify; mind their dates):
• arXiv:2506.02153 (Small Language Models, 2025)
• arXiv:2604.08224 (Externalization in LLM Agents, 2026)
• arXiv:2603.01213 (Can AI Agents Agree?, 2026)
• arXiv:2605.23218 (Foundation Protocol, 2026)

Your task:
(1) RE-TEST the token-budget dominance and scaling parallelism claims. Have newer SLM orchestration methods, post-training techniques, or multi-agent harnesses since relaxed the cost gap or changed which layers benefit most from scaling? Isolate what *still* holds (scaling curves likely durable; cost ratios may have shifted).
(2) Surface the strongest work contradicting the multi-agent ceiling claim — any papers from late 2025 or 2026 showing coordination *does* scale, or that single-agent reasoning isn't uniformly superior. Identify the key disagreement (task structure? protocol design?).
(3) Propose 2 research questions assuming the regime has moved: (a) How do proportionality constraints interact with credential-holding agents in economic settings? (b) Can proportionality be *learned* end-to-end rather than hand-tuned per harness?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Should an AI agent use the same horsepower on every subtask, or match its effort to what each step actually needs?

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8