INQUIRING LINE

Can the scaling law for discovery extend beyond architectures to agentic systems?

This explores whether the finding that scientific discovery (like architecture search) scales predictably with compute — the way model performance does — also holds for agentic systems that search, coordinate, and reason at test time.


This explores whether the finding that scientific discovery (like architecture search) scales predictably with compute also holds for agentic systems. The corpus says: largely yes, with one sharp caveat about coordination. The anchor result is that autonomous architecture discovery follows an empirical scaling law — ASI-ARCH found 106 state-of-the-art designs across 1,773 experiments, and breakthroughs scaled with GPU compute rather than with human cleverness Can computational power accelerate scientific discovery itself?. The interesting move is to ask whether that compute-scales-discovery pattern is a quirk of architecture search or a general property of search itself.

Several notes suggest it's general. Deep research agents exhibit a test-time scaling law where search budget — the number of retrieval steps — follows the same diminishing-returns curve as reasoning tokens, making search a compute axis you can trade against reasoning How does search scale like reasoning in agent systems? Does search budget scale like reasoning tokens for answer quality? Do search steps follow the same scaling rules as reasoning tokens?. So 'spend more compute, find more' isn't unique to discovering architectures; it reappears whenever an agent is exploring a solution space. Reasoning systems can even scale in width — sampling parallel latent trajectories — to cover that space without paying the serial latency of going deeper Can reasoning systems scale wider instead of only deeper?.

There's a deeper reason discovery keeps paying off rather than saturating. Agentic graph reasoning self-organizes into a critical state where roughly 12% of connections stay 'semantically surprising' even after they're structurally linked — meaning the system keeps generating novelty as it runs, fueling continuous discovery instead of collapsing into repetition Why do reasoning systems keep discovering new connections?. That's a mechanism for why a discovery scaling law could persist at the agent level: the search frontier doesn't go dry.

But here's the thing the question doesn't anticipate. When you scale agentic systems by adding *agents* rather than compute-per-agent, the curve bends the wrong way. Multi-agent coordination degrades predictably with network scale — agents commit too late or adopt strategies without telling neighbors, and they accept information without verifying it, so errors propagate Why do multi-agent systems fail to coordinate at scale?. And most multi-agent performance gains turn out to be a token-spending function: ~80% of the variance is budget, not coordination intelligence How does test-time scaling work at the agent level?. So scaling 'works' largely because you're spending more compute, not because the collective gets smarter — which is consistent with the discovery scaling law, but also a warning that adding bodies is not the same axis as adding compute.

What this leaves you with: the scaling law for discovery does extend to agentic search and reasoning, but the gains live in the compute axis (search budget, trajectory width, self-organized novelty), not in coordination. Reliability at scale comes from externalizing memory, skills, and protocols into a harness rather than from model or population size Where does agent reliability actually come from?, and efficiency techniques across memory, tools, and planning all converge on the same structural constraints — context bounding, fewer external calls, controlled search Do efficiency techniques across agent components reveal shared structural constraints?. The honest synthesis is that discovery scales with compute even for agents; coordination is the part that resists the curve.


Sources 10 notes

Can computational power accelerate scientific discovery itself?

ASI-ARCH discovered 106 state-of-the-art architectures through 1,773 autonomous experiments, revealing that architectural breakthroughs scale predictably with GPU compute. This transforms research from human-limited to computation-scalable.

How does search scale like reasoning in agent systems?

Test-time scaling laws generalize from reasoning to retrieval: search steps follow identical scaling curves to reasoning tokens, making deep research a test-time scaling problem. This insight reframes search as a compute axis comparable to chain-of-thought reasoning.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Why do reasoning systems keep discovering new connections?

Analysis shows iterative graph reasoning evolves toward a stable phase where semantic entropy persistently dominates structural entropy, with ~12% of edges remaining semantically surprising despite structural connection, fueling ongoing discovery.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Do efficiency techniques across agent components reveal shared structural constraints?

Techniques for memory, tool learning, and planning independently converge on shared principles: context bounding, minimizing external calls, and controlled search. This convergence suggests these reflect fundamental structural pressures in agentic computation rather than component-specific optimizations.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-testing whether discovery scaling laws extend to agentic systems. The question remains: does compute-scales-discovery hold beyond architecture search, and if so, where does it break?

What a curated library found — and when (dated claims, not current truth): Research spanning Feb 2025–Apr 2026 identifies these constraints:
• Autonomous architecture discovery follows predictable scaling: 106 SOTA designs across 1,773 experiments, breakthroughs correlate with GPU compute, not human cleverness (2025).
• Deep research agents exhibit test-time scaling: search budget (retrieval steps) follows the same diminishing-returns curve as reasoning tokens — a compute axis you can trade against reasoning (2025–2026).
• Multi-agent coordination degrades predictably with network scale; ~80% of multi-agent performance variance is budget-driven, not coordination intelligence (2026).
• Agent reliability scales via externalization (memory, skills, protocols into harness) rather than model or population size; efficiency gains converge on context bounding and controlled search (2026).
• Self-organized graph reasoning sustains ~12% 'semantically surprising' connections even post-linking, fueling continuous discovery (2025).

Anchor papers (verify; mind their dates):
• arXiv:2507.18074 — AlphaGo Moment for Model Architecture Discovery (2025)
• arXiv:2506.18959 — From Web Search towards Agentic Deep Research (2026)
• arXiv:2604.02460 — Single-Agent LLMs Outperform Multi-Agent Systems (2026)
• arXiv:2604.08224 — Externalization in LLM Agents (2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For multi-agent coordination degrades predictably (80% budget variance): have orchestration frameworks (e.g., ensemble caching, hierarchical delegation, retrieval-augmented multi-hop) since relaxed this? Where does the constraint still hold—pure emergent reasoning vs. tool-augmented tasks? Separately: does self-organized novelty at 12% semantic surprise persist with larger graphs or longer horizons, or does it saturate?
(2) Surface the strongest CONTRADICTING work in the last 6 months: does any recent paper show coordination *does* add intelligence orthogonal to budget, or show discovery scaling hitting hard limits in agentic regimes (e.g., hallucination under search depth, exploration collapse)?
(3) Propose 2 new questions: (a) Does scaling discovery require *different* compute budgets (search vs. reasoning tokens) for different agent archetypes (reactive retrieval vs. planning-heavy)? (b) Can you predict when to stop scaling search budget *per agent* and start scaling agents, given a task's exploration-exploitation ratio?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines