INQUIRING LINE

Inquiring lines›How do language models construct a…›How are AI-generated and human-wri…›Do autonomous architecture discove…›this inquiring line

More compute reliably produces more AI discoveries — but does that rule hold when the discoverer is an autonomous agent?

Can the scaling law for discovery extend beyond architectures to agentic systems?

This explores whether the finding that scientific discovery (like architecture search) scales predictably with compute — the way model performance does — also holds for agentic systems that search, coordinate, and reason at test time.

This explores whether the finding that scientific discovery (like architecture search) scales predictably with compute also holds for agentic systems. The corpus says: largely yes, with one sharp caveat about coordination. The anchor result is that autonomous architecture discovery follows an empirical scaling law — ASI-ARCH found 106 state-of-the-art designs across 1,773 experiments, and breakthroughs scaled with GPU compute rather than with human cleverness Can computational power accelerate scientific discovery itself?. The interesting move is to ask whether that compute-scales-discovery pattern is a quirk of architecture search or a general property of search itself.

Several notes suggest it's general. Deep research agents exhibit a test-time scaling law where search budget — the number of retrieval steps — follows the same diminishing-returns curve as reasoning tokens, making search a compute axis you can trade against reasoning How does test-time scaling work for individual research agents? Does search budget scale like reasoning tokens for answer quality? Do search steps follow the same scaling rules as reasoning tokens?. So 'spend more compute, find more' isn't unique to discovering architectures; it reappears whenever an agent is exploring a solution space. Reasoning systems can even scale in width — sampling parallel latent trajectories — to cover that space without paying the serial latency of going deeper Can reasoning systems scale faster by exploring parallel paths instead?.

There's a deeper reason discovery keeps paying off rather than saturating. Agentic graph reasoning self-organizes into a critical state where roughly 12% of connections stay 'semantically surprising' even after they're structurally linked — meaning the system keeps generating novelty as it runs, fueling continuous discovery instead of collapsing into repetition Why do reasoning systems keep discovering new connections?. That's a mechanism for why a discovery scaling law could persist at the agent level: the search frontier doesn't go dry.

But here's the thing the question doesn't anticipate. When you scale agentic systems by adding *agents* rather than compute-per-agent, the curve bends the wrong way. Multi-agent coordination degrades predictably with network scale — agents commit too late or adopt strategies without telling neighbors, and they accept information without verifying it, so errors propagate Why do multi-agent systems fail to coordinate at scale?. And most multi-agent performance gains turn out to be a token-spending function: ~80% of the variance is budget, not coordination intelligence How does test-time scaling work at the agent level?. So scaling 'works' largely because you're spending more compute, not because the collective gets smarter — which is consistent with the discovery scaling law, but also a warning that adding bodies is not the same axis as adding compute.

What this leaves you with: the scaling law for discovery does extend to agentic search and reasoning, but the gains live in the compute axis (search budget, trajectory width, self-organized novelty), not in coordination. Reliability at scale comes from externalizing memory, skills, and protocols into a harness rather than from model or population size Where does agent reliability actually come from?, and efficiency techniques across memory, tools, and planning all converge on the same structural constraints — context bounding, fewer external calls, controlled search Do efficiency techniques across agent components reveal shared structural constraints?. The honest synthesis is that discovery scales with compute even for agents; coordination is the part that resists the curve.

Sources 10 notes

Can computational power accelerate scientific discovery itself?

ASI-ARCH discovered 106 state-of-the-art architectures through 1,773 autonomous experiments, revealing that architectural breakthroughs scale predictably with GPU compute. This transforms research from human-limited to computation-scalable.

How does test-time scaling work for individual research agents?

Research shows that deep research agents exhibit test-time scaling laws where search steps scale similarly to reasoning tokens, and live search outperforms memorized retrieval on knowledge-intensive tasks. Data efficiency is extreme—78 curated demonstrations outperform 10K samples for agency.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Can reasoning systems scale faster by exploring parallel paths instead?

GRAM demonstrates that recursive reasoning models should maintain and explore multiple latent trajectories in parallel, not only deepen single paths. Width-scaling avoids the serial latency penalty of depth while sampling the solution distribution more effectively on ambiguous problems.

Show all 10 sources

Why do reasoning systems keep discovering new connections?

Analysis shows iterative graph reasoning evolves toward a stable phase where semantic entropy persistently dominates structural entropy, with ~12% of edges remaining semantically surprising despite structural connection, fueling ongoing discovery.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Do efficiency techniques across agent components reveal shared structural constraints?

Techniques for memory, tool learning, and planning independently converge on shared principles: context bounding, minimizing external calls, and controlled search. This convergence suggests these reflect fundamental structural pressures in agentic computation rather than component-specific optimizations.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents3.48 match · arxiv ↗
Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets3.39 match · arxiv ↗
Towards a Science of Scaling Agent Systems2.59 match · arxiv ↗
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning2.53 match · arxiv ↗
Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs2.50 match · arxiv ↗
From Model Scaling to System Scaling: Scaling the Harness in Agentic AI2.49 match · arxiv ↗
LLMs Corrupt Your Documents When You Delegate2.49 match · arxiv ↗
When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling1.74 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-testing whether discovery scaling laws extend to agentic systems. The question remains: does compute-scales-discovery hold beyond architecture search, and if so, where does it break?

What a curated library found — and when (dated claims, not current truth): Research spanning Feb 2025–Apr 2026 identifies these constraints:
• Autonomous architecture discovery follows predictable scaling: 106 SOTA designs across 1,773 experiments, breakthroughs correlate with GPU compute, not human cleverness (2025).
• Deep research agents exhibit test-time scaling: search budget (retrieval steps) follows the same diminishing-returns curve as reasoning tokens — a compute axis you can trade against reasoning (2025–2026).
• Multi-agent coordination degrades predictably with network scale; ~80% of multi-agent performance variance is budget-driven, not coordination intelligence (2026).
• Agent reliability scales via externalization (memory, skills, protocols into harness) rather than model or population size; efficiency gains converge on context bounding and controlled search (2026).
• Self-organized graph reasoning sustains ~12% 'semantically surprising' connections even post-linking, fueling continuous discovery (2025).

Anchor papers (verify; mind their dates):
• arXiv:2507.18074 — AlphaGo Moment for Model Architecture Discovery (2025)
• arXiv:2506.18959 — From Web Search towards Agentic Deep Research (2026)
• arXiv:2604.02460 — Single-Agent LLMs Outperform Multi-Agent Systems (2026)
• arXiv:2604.08224 — Externalization in LLM Agents (2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For multi-agent coordination degrades predictably (80% budget variance): have orchestration frameworks (e.g., ensemble caching, hierarchical delegation, retrieval-augmented multi-hop) since relaxed this? Where does the constraint still hold—pure emergent reasoning vs. tool-augmented tasks? Separately: does self-organized novelty at 12% semantic surprise persist with larger graphs or longer horizons, or does it saturate?
(2) Surface the strongest CONTRADICTING work in the last 6 months: does any recent paper show coordination *does* add intelligence orthogonal to budget, or show discovery scaling hitting hard limits in agentic regimes (e.g., hallucination under search depth, exploration collapse)?
(3) Propose 2 new questions: (a) Does scaling discovery require *different* compute budgets (search vs. reasoning tokens) for different agent archetypes (reactive retrieval vs. planning-heavy)? (b) Can you predict when to stop scaling search budget *per agent* and start scaling agents, given a task's exploration-exploitation ratio?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

More compute reliably produces more AI discoveries — but does that rule hold when the discoverer is an autonomous agent?

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8