INQUIRING LINE

How should proportionality constraints be implemented in agentic systems?

This explores how agentic systems should match the resources they spend — compute, model size, search depth, coordination overhead — to what a task actually demands, rather than over- or under-provisioning by default.


This explores how agentic systems should match the resources they spend — model size, token budget, search depth, coordination overhead — to what a task actually demands. The corpus doesn't use the phrase "proportionality constraints," but it circles the same idea from several directions, and the most striking finding is that in agent systems, capability is largely a *spending* decision. Research shows roughly 80% of multi-agent performance variance comes from token budget, not coordination intelligence How does test-time scaling work at the agent level? — and that search steps follow the same scaling curve as reasoning tokens, making retrieval just another compute axis you dial up or down How does search scale like reasoning in agent systems?. If performance tracks spend this directly, then deciding *how much* to spend on a given subtask isn't a tuning detail; it's the core design lever. Proportionality is what stops you from paying frontier-model prices for clerical work.

The sharpest concrete implementation is heterogeneous model routing: use small language models by default and reserve large ones for the moments that genuinely need them. SLMs handle the repetitive, well-defined subtasks that make up most agent work at 10–30× lower cost, which makes "small by default, large selectively" the economically rational pattern rather than a compromise Can small language models handle most agent tasks?. That's proportionality at the model-selection layer — and notably, the same logic shows up independently across other components. Techniques for memory, tool use, and planning all converge on the same three moves: bound the context, minimize external calls, control the search Do efficiency techniques across agent components reveal shared structural constraints?. When unrelated subsystems independently discover "spend less unless the task earns more," that's a sign proportionality reflects a structural pressure in agentic computation, not a per-component hack.

Where should the constraint actually live? The corpus suggests: not inside the model, but in the harness around it. Reliable agents externalize their cognitive burdens — state, procedural skills, structured protocols — into a harness layer rather than leaning on raw model scale Where does agent reliability actually come from?. That's the natural home for a proportionality policy too: a routing and budgeting layer that decides which model, how many search steps, and how much coordination each task gets. Representing agents as optimizable computational graphs makes this even more concrete — if nodes (operations) and edges (information flow) are explicit, you can automatically tune both the prompts *and* the connectivity, which means budget allocation becomes something you optimize rather than guess Can we automatically optimize both prompts and agent coordination?.

The part you might not expect: proportionality matters *most* precisely where adding more agents stops helping. Multi-agent coordination degrades predictably as the network grows — agents agree too late or adopt strategies without telling their neighbors Why do multi-agent systems fail to coordinate at scale? — and consensus tends to fail through liveness loss (timeouts, stalled convergence) rather than corrupted values, with agreement getting worse as group size grows even with no bad actors present Can LLM agent groups reliably reach consensus together?. So throwing more coordinating agents at a problem has a real ceiling. A proportionality constraint should therefore govern not just compute-per-task but *number of participants* — and it should bias toward composing existing protocols rather than building heavier new ones, since coordination layers win by wrapping standards like MCP rather than replacing them Should coordination protocols wrap existing systems or replace them?.

One honest gap worth naming: the corpus is rich on proportioning *machine* resources but thin on the hardest case — when to spend a *human's* attention. The closest material reframes the problem: rather than solving optimal hand-off timing directly (there's no ground truth for it), distribute the decision across six interaction touchpoints like action guards and verification When should human-agent systems ask for human help?. And as agents start holding credentials and moving value, the binding constraint shifts from capability to governance — can they coordinate, settle, and leave auditable evidence When do agents need coordination more than raw capability?. That suggests the next frontier for proportionality isn't cost-per-token at all; it's matching the *level of oversight* to the stakes of the action.


Sources 11 notes

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

How does search scale like reasoning in agent systems?

Test-time scaling laws generalize from reasoning to retrieval: search steps follow identical scaling curves to reasoning tokens, making deep research a test-time scaling problem. This insight reframes search as a compute axis comparable to chain-of-thought reasoning.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Do efficiency techniques across agent components reveal shared structural constraints?

Techniques for memory, tool learning, and planning independently converge on shared principles: context bounding, minimizing external calls, and controlled search. This convergence suggests these reflect fundamental structural pressures in agentic computation rather than component-specific optimizations.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Can we automatically optimize both prompts and agent coordination?

Language agents represented as computational graphs—where nodes are operations and edges define information flow—reveal that CoT, ToT, and Reflexion are formally equivalent structures. This unified view enables automatic optimization of both node prompts and edge connectivity without manual redesign.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Can LLM agent groups reliably reach consensus together?

Across hundreds of simulations, LLM-agent groups frequently fail to reach valid agreement due to timeouts and stalled convergence rather than subtle value corruption. Agreement degrades with group size even without Byzantine agents present.

Should coordination protocols wrap existing systems or replace them?

Research shows that agent coordination standards achieve adoption by composing existing protocols like MCP and DIDComm under a shared substrate, rather than competing to replace them. Bridging lets value accrue incrementally without forcing ecosystem-wide rewrites.

When should human-agent systems ask for human help?

Magentic-UI identifies co-planning, co-tasking, action guards, verification, memory, and multitasking as mechanisms that work around the lack of ground truth for optimal deferral timing. Rather than solving the timing problem directly, these mechanisms distribute decision-making across multiple touchpoints.

When do agents need coordination more than raw capability?

Once agents hold credentials, transact value, and interact with other agents, raw model capability stops being the limiting factor. The real bottleneck becomes whether agents can coordinate reliably, settle accounts, and leave auditable evidence of their actions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about proportionality constraints in agentic systems. The question remains: *how should agentic systems match resource spend to task demand?* This is still open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat all as perishable:
• ~80% of multi-agent performance variance comes from token budget, not coordination logic (2025–2026).
• Search steps and reasoning tokens follow identical scaling curves; retrieval is a compute axis you dial (2025).
• Heterogeneous model routing (small by default, large selectively) reduces cost 10–30× on routine subtasks (2026).
• Coordination degrades predictably with network scale; agents fail through liveness loss (timeouts), not value corruption (2026).
• Proportionality constraints live best in the *harness* (routing/budgeting layer), not inside the model; they're optimizable via computational graphs (2024–2026).

Anchor papers (verify; mind their dates):
• arXiv:2506.02153 (Small Language Models, 2025)
• arXiv:2604.08224 (Externalization in LLM Agents, 2026)
• arXiv:2603.01213 (Can AI Agents Agree?, 2026)
• arXiv:2605.23218 (Foundation Protocol, 2026)

Your task:
(1) RE-TEST the token-budget dominance and scaling parallelism claims. Have newer SLM orchestration methods, post-training techniques, or multi-agent harnesses since relaxed the cost gap or changed which layers benefit most from scaling? Isolate what *still* holds (scaling curves likely durable; cost ratios may have shifted).
(2) Surface the strongest work contradicting the multi-agent ceiling claim — any papers from late 2025 or 2026 showing coordination *does* scale, or that single-agent reasoning isn't uniformly superior. Identify the key disagreement (task structure? protocol design?).
(3) Propose 2 research questions assuming the regime has moved: (a) How do proportionality constraints interact with credential-holding agents in economic settings? (b) Can proportionality be *learned* end-to-end rather than hand-tuned per harness?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines