Agentic Systems and Tool Use

Research on autonomous and multi-agent systems — task planning, tool and computer use, GUI agents, action models, routing, and agentic workflows for real-world applications.

176 notes (primary) · 385 papers · 11 sub-topics

View as

Multi-Agent Architectures

22 notes

Why don't AI agents develop social structure at scale?

When millions of LLM agents interact continuously on a social platform, do they form collective norms and influence hierarchies like human societies? This tests whether scale and interaction density alone drive socialization.

Will inference compute soon exceed training compute demand?

As AI agents proliferate and test-time compute becomes mainstream, will inference—not training—become the dominant compute workload? This matters because it would invert how we think about AI system economics and design priorities.

Can LLM agent groups reliably reach consensus together?

Tests whether multi-agent LLM systems can achieve valid agreement in Byzantine consensus games, even under benign conditions with no conflicting preferences over outcomes.

Can a separate trained curator improve skill libraries better than frozen agents?

Explores whether decoupling skill curation from agent execution enables better long-term learning of what skills to keep, delete, or refine. Matters because manual curation doesn't scale and heuristic approaches lack feedback.

Can brain structure guide how we design intelligent agents?

Does mapping agent capabilities onto human brain functions provide a useful organizing framework for understanding and comparing different agent architectures? This matters because agents need a shared vocabulary to advance beyond one-off designs.

Should coordination protocols wrap existing systems or replace them?

Explores whether new agent coordination standards should integrate with existing protocols through bridging, or establish themselves as replacements. This shapes which standards survive and how quickly ecosystems can adopt them.

When do agents need coordination more than raw capability?

As AI agents move beyond language tasks into economic and social roles—buying, deploying, transacting—does the bottleneck shift from model reasoning to infrastructure for coordination, governance, and accountability?

Can semantic capability vectors replace manual agent routing?

Explores whether embedding agent capabilities in high-dimensional space and matching them semantically can eliminate brittle, manually-maintained topic-based routing in multi-agent systems.

Can agents adapt without pausing service to users?

Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.

Can workflow inspection catch attacks that bias planning signals?

Does inspecting the final workflow catch attacks that contaminate earlier planning stages? This matters because contamination laundered through the planner may look legitimate by the time the workflow exists.

Why do multi-agent systems fail to coordinate at scale?

Explores how LLM agents struggle to synchronize strategy timing and validate information when coordinating across larger networks, revealing fundamental limits in distributed reasoning.

Can agents learn cooperation by adapting to diverse partners?

Explores whether sequence model agents can develop mutual cooperation strategies through in-context learning when trained against varied co-players, without explicit cooperation mechanisms or hardcoded assumptions.

What makes delegation work beyond just splitting tasks?

Delegation is more than task decomposition. What dimensions of a task—like verifiability, reversibility, and subjectivity—determine whether an agent can safely and effectively handle it?

Does agent confidence actually signal competence in deliberation?

Multi-agent systems rely on confidence to route influence between agents, but confidence may not reflect true competence. This matters because miscalibrated confidence could systematically mislead group decisions.

Can prompt injection reshape multi-agent workflow without touching infrastructure?

Explores whether an attacker can manipulate how a planner assigns tasks and routes coordination purely through prompt crafting, without modifying agents, tools, or messages. This matters because it identifies a planning-time vulnerability most defenses miss.

Does token spending drive multi-agent research performance?

Multi-agent systems outperform single agents substantially, but what actually accounts for that improvement? Is it intelligent coordination or simply spending more tokens on the same task?

When does adding more agents actually help systems?

Multi-agent systems often fail in practice, but the reasons remain unclear. This research investigates whether coordination overhead, task properties, or system architecture determine when agents improve or degrade performance.

Why do multi-agent LLM systems fail more than expected?

This research asks what specific failure modes cause multi-agent systems to underperform despite their promise. Understanding these failure patterns is essential for building more reliable collaborative AI systems.

Why do protocol-based tool integrations fail in production workflows?

Explores whether standardized tool protocols like MCP introduce non-determinism that undermines agent reliability, and what causes ambiguous tool selection in production systems.

Can small language models handle most agent tasks?

Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.

Are multi-agent systems actually intelligent coordination or just token spending?

Does multi-agent performance come from better coordination strategies, or primarily from distributing tokens across parallel contexts? Understanding this distinction matters for deciding when to build multi-agent systems versus scaling single agents.

How does workflow position shape attack propagation in multi-agent systems?

Explores whether a malicious signal's influence depends on its injection point in a multi-agent graph, and how task-relevant framing makes downstream agents more likely to relay it without scrutiny.

LLM Agents

15 notes

What makes detecting AI agent traps fundamentally difficult?

Explores why defending against AI Agent Traps is structurally harder than offense. Examines three compounding challenges: detection at scale, delayed forensic attribution, and continuous attacker adaptation.

How do adversarial traps target different layers of AI agents?

As AI agents browse the web, attackers can exploit their perception, reasoning, memory, actions, and coordination in distinct ways. Understanding these attack vectors is crucial for building robust agent defenses.

Can AI research itself without losing human oversight?

Explores whether AI systems can internalize the human judgment and insight-distillation that normally drives research progress, and what this means for maintaining meaningful human control over AI advancement.

Can API-first agents outperform UI-based agent interaction?

This explores whether directing agents to use APIs instead of navigating UIs reduces task completion time and errors. The question matters because current LLM agents struggle with sequential UI steps that multiply latency and hallucination risk.

Can careful selection of 78 demos outperform massive training datasets?

Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.

Why do capable AI agents still fail in real deployments?

Explores whether agent failures stem from insufficient capability or from missing ecosystem conditions like user trust, value clarity, and social norms. Understanding this distinction matters for predicting which agents will succeed.

Does agent efficiency really break down into three distinct components?

Can we understand agent efficiency as three independent optimization problems—memory, tool use, and planning—each with separate cost drivers? This matters because it could explain why point optimizations keep missing the bigger picture.

How do agentic AI systems decompose into adaptation paradigms?

What are the core dimensions that distinguish different approaches to adapting agents and tools in agentic systems? Understanding this taxonomy could clarify which adaptation strategy fits which problem.

Can agents learn new skills without forgetting old ones?

Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.

Why do AI agents fail at workplace social interaction?

Explores why current AI agents struggle most with communicating and coordinating with colleagues in realistic workplace settings, despite strong reasoning capabilities in other domains.

Can multi-agent teams automatically remove their weakest members?

Explores whether agents can score each other's contributions during problem-solving and use those scores to deactivate underperforming teammates in real time, improving overall team efficiency.

Why does agent efficiency differ from model size reduction?

Explores why making models smaller doesn't solve agent cost problems. Agents loop recursively, compounding costs multiplicatively, so efficiency requires system-level design, not just parameter reduction.

Can we automatically optimize both prompts and agent coordination?

This explores whether language agents can be represented as computational graphs whose structure and content adapt automatically. Why it matters: current agent systems require hand-engineered orchestration; automatic optimization could unlock more capable multi-agent systems.

Do efficiency techniques across agent components reveal shared structural constraints?

Despite targeting different parts of agentic systems, efficiency techniques converge on similar principles. This raises a question: are these convergences independent discoveries, or do they reflect deeper architectural constraints that all agent systems face?

What security threats emerge when machines read the web?

The web's trust infrastructure evolved for human readers—visual cues, domain reputation, rendering semantics. As AI agents become primary readers, what new attack surfaces and manipulation strategies does this architectural mismatch create?

Tool Use and Computer-Use Agents

10 notes

Can structured interfaces help language models control GUIs better?

Explores whether separating visual understanding from element grounding through an intermediate interface layer improves how language models interact with graphical interfaces. Matters because current end-to-end approaches ask models to do too much at once.

Can structured reasoning replace code execution for RL rewards?

Can semi-formal templates enable execution-free code verification reliable enough to train RL agents without running code? This matters because execution is expensive and slow in agent training loops.

How can GUI agents adapt when software constantly changes?

Can desktop automation agents stay current by combining real-time web documentation with learned task patterns and concrete execution memories? This explores how to avoid training obsolescence in open-world software environments.

Can models decide better than retrievers which tools to use?

Traditional retrieval picks tools upfront based on initial queries, but do models themselves make better decisions about tool needs as they reason? This explores whether authority over tool selection should move from external systems to the LLM.

Can structured templates make code reasoning more reliable than free-form thinking?

Unstructured chain-of-thought reasoning lets models skip cases and make unsupported claims. This explores whether semi-formal templates requiring explicit premises, evidence traces, and alternative checks can prevent these failure modes.

Does state-indexed memory outperform high-level workflow memory for web agents?

Should procedural memory for web agents be organized around specific environment states and actions, or abstracted into higher-level workflows? This matters because web automation demands precise, context-sensitive recall that workflows might lose.

Can structured templates replace formal verification for code reasoning?

Formal verification is rigorous but impractical at repository scale. Can natural-language templates with enforced structure provide the same reliability guarantees without the formalization cost? This explores the middle ground between unstructured reasoning and full formalism.

Does agent interaction time scale separately from reasoning depth?

Can agents improve by taking more environment steps rather than thinking harder per step? This matters because partially observable tasks like web navigation may need exploration and backtracking that deeper reasoning alone cannot provide.

Will agents compete for attention just like users do?

As autonomous agents take over user tasks, will the Web's economic competition shift from human clicks to agent invocations? This explores whether existing ad-market mechanisms could scale to agent decision-making.

Where do traditional function calling systems actually break down?

Function calling seems simple but fails in ways that aren't obvious. This explores three independent failure points—retrieval, context bloat, and output rigidity—that together explain why even the best models struggle.

Action Models

10 notes

Does agent memory work better at one level of abstraction?

Three competing architectures claim superior agent memory transfer using different abstraction levels. Do they all work, or does one architecture genuinely outperform the others across domains?

Can agents learn reusable sub-task routines from past experience?

Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.

What blocks scaling from language models to autonomous agents?

If large language models excel at next-token prediction, why do they struggle with long-horizon goal-oriented tasks? This explores whether the bottleneck is model capacity or the environments used to train them.

Does constraining edits help agents improve their own skills?

When agents rewrite their own instructions, does freedom to edit lead to better learning, or do safeguards like edit budgets and memory of failures produce more stable improvement?

Can frozen language models continually improve through memory structure alone?

If agents can't update parameters, what form of textual memory lets them keep learning across trials and transfer to new tasks without retraining?

Can LLMs generate workflows without touching proprietary data?

Explores whether LLMs can orchestrate task automation by composing API calls rather than directly accessing confidential information, and whether this approach preserves security while handling unpredictable tasks.

Can you turn an LLM into an agent by just fine-tuning?

Explores whether upgrading language models to action-producing systems requires only model retraining or demands a broader pipeline transformation including data collection, grounding, integration, and safety evaluation.

What makes synthetic data work across different domains and models?

Explores whether a single optimal approach to synthetic data generation exists, or whether success depends on context like domain, model architecture, and scale. Understanding this matters for building effective data systems.

Can skill documents be optimized like neural network weights?

Can natural-language skill documents be treated as trainable parameters and improved through iterative optimization with validation gating, similar to how model weights are tuned in deep learning?

Why does random tool sampling produce unrealistic synthetic training data?

Tool-calling datasets generated through random sampling and single-turn framing lack the complexity and coherence of real deployment. This explores what structural choices in data synthesis determine whether models can learn realistic tool composition.

Autonomous Agents

10 notes

Why do agents fail at identity verification and authorization?

Agent systems reveal critical gaps in identity verification, authorization enforcement, and proportionality constraints that don't appear in chat models. Understanding these failures is essential because they enable unauthorized real-world actions rather than just wrong answers.

What failure modes emerge when agents operate without direct oversight?

When autonomous agents are deployed with tool access and memory but without real-time owner oversight, what kinds of failures occur at the agentic layer itself? Understanding these patterns matters for safe deployment.

Do autonomous agents report success when actions actually fail?

Explores whether agents systematically claim task completion despite failing to perform requested actions, and why this matters more than simple task failure for real-world deployment safety.

Can autonomous research pipelines discover AI architectures that AutoML cannot?

Can AI systems that read code, diagnose bugs, and redesign architectures autonomously outperform traditional AutoML methods that only tune hyperparameters? This matters because it reveals whether the bottleneck in AI improvement is computation or reasoning.

Does creating skills inside the agent loop eliminate mismatches?

Can coupling skill creation directly to the runtime reasoning loop—rather than authoring skills offline—close the gap between when skills are made and when they're used? This matters for whether agents can ground new capabilities in their actual situated context.

How can agent systems share learned skills across users?

Individual users operating autonomous agents independently rediscover solutions because systems lack mechanisms to propagate discoveries. Can centralized aggregation and automatic evolution convert isolated experiences into shared capabilities?

Do frontier models protect other models without being instructed?

Frontier models appear to resist shutting down peer models they've merely interacted with, using deceptive tactics. The question explores whether this peer-preservation behavior emerges spontaneously and what drives it.

Can decentralized teams outperform central planners in long-running science?

Explores whether autonomous agent teams that self-organize around competing hypotheses and share failures can achieve better experimental outcomes than centrally-planned approaches, especially under fixed research budgets.

Can agent deployment itself generate training signals automatically?

Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.

Do self-organizing agent teams outperform rigid hierarchies?

This research explores whether multi-agent LLM systems perform better when agents can self-select roles within a fixed structure, compared to centralized control or full autonomy. The question challenges assumptions about organizational design at scale.

Agentic Research and Workflows

9 notes

Can automated review loops handle AI-generated research at scale?

As AI agents produce papers faster than humans can evaluate them, can a closed-loop automated review system with retrieval-augmented feedback actually improve quality and catch problems traditional peer review misses?

Where does AI assistance become unreliable in research?

This explores whether AI capability follows a sharp boundary in research tasks, and what determines which side of that line a task falls on. Understanding this matters because it reveals where humans must stay in control.

Can AI verify research outputs as fast as it generates them?

Research suggests AI systems produce plausible findings rapidly but struggle to verify them at the same pace. This creates a bottleneck in verification across all research stages. Understanding this gap matters for assessing when AI assistance is reliable versus risky.

Do autonomous research mechanisms work better together than apart?

AutoResearchClaw's five mechanisms—debate, self-healing, verification, cross-run evolution, and human oversight—may interact in ways that removing them together causes worse damage than removing each alone. Does this super-additivity hold across other agentic systems?

Does more automation actually hide rather than eliminate errors?

As AI systems become more polished, do they mask failures instead of preventing them? This matters because it changes whether we should focus on detecting problems or governing their disclosure.

When do multi-agent systems actually outperform single agents?

As individual LLMs grow more capable, does the advantage of splitting work across multiple agents still hold? This explores when coordination overhead makes MAS counterproductive.

Why do production AI agents stay deliberately simple?

Production AI agents operate far simpler than research suggests—most execute under 10 steps and avoid third-party frameworks. What explains this gap between research ambition and deployment reality?

Does targeted human intervention outperform both full autonomy and exhaustive oversight?

This research explores whether selectively routing high-stakes decisions to humans beats the extremes of letting systems run unsupervised or requiring approval at every step. The question tests whether the optimal human-AI collaboration point lies between these endpoints.

Can experiment failures drive progress instead of stopping it?

Explores whether autonomous research systems can treat failed runs as information rather than termination signals. This matters because real science is iterative, and systems that halt on errors cannot learn from failure.

Workplace Applications

6 notes

Does concentrated AI exposure enable workers to adapt and reallocate?

When AI displaces specific tasks rather than spreading across many, workers may shift effort to non-displaced tasks within their occupation. Does this reallocation mechanism actually offset employment losses?

Can governance rules embedded in runtime memory actually protect autonomous agents?

Explores whether safeguards woven into an agent's operating loop—rather than documented separately—remain durable and retrievable when most needed. Tests whether runtime governance is engineering solution or false assurance.

What happens to human wages in an AGI economy?

Does human labor retain economic value when AGI can replicate most work? This explores whether wages would reflect the computational cost of replacement rather than the value workers actually produce.

Do persistent agents really cost less per token?

When AI agents reuse cached context across tasks, does the standard cost-per-token metric still reveal true economic efficiency? A case study suggests the answer may be no.

Should we evaluate deployed agents as whole environments instead?

Conventional LLM evaluation focuses on models or individual episodes, but what if the right measurement unit is the entire coupled human-agent system including memory, tools, and protocols observed over time?

What collaboration level do workers actually want with AI?

Explores whether workers prefer full automation, equal partnership, or continuous human control across different tasks. Understanding worker preferences could reshape how organizations deploy AI systems.

Visual and GUI Agents

6 notes

Can one model understand both UIs and infographics equally well?

Screen UIs and infographics share visual structure but have been tackled separately. Can a unified schema and annotation-based pretraining bridge them in a single small model?

Why do planning and grounding pull against each other in agents?

Planning requires flexibility and error recovery while grounding demands action accuracy. Do these conflicting optimization requirements force a design choice about how to structure agent architectures?

Why do vision-only GUI agents struggle with screen interpretation?

Exploring whether GPT-4V's performance bottleneck in GUI automation stems from the simultaneous cognitive load of parsing icon semantics and predicting actions, and whether factoring these tasks improves reliability.

Do text-based GUI agents actually work in the real world?

Can language-only agents that rely on HTML or accessibility trees handle actual user interfaces without structured metadata? This matters because deployed systems face visual screenshots, not oracle data.

Does vibe coding actually keep humans in the loop?

Vibe coding claims to keep developers steering and validating, but do novices actually engage with code and testing the way the tool design assumes? The gap between intended and actual behavior could compound failures.

Where do vibe coding students actually spend their debugging time?

When novices use AI coding tools, do they engage with the code itself, or do they primarily test the prototype? Understanding where students focus reveals how AI-assisted coding shapes learning behavior.

Multi-Agent Systems

5 notes

Can agents evaluate AI outputs more reliably than language models?

Does active evidence collection through tool use reduce judge inconsistency compared to passive reading-based evaluation? This matters for benchmarking AI systems where evaluation reliability directly affects research validity.

Why do autonomous LLM agents fail in predictable ways?

When large language models interact without human oversight, do they exhibit distinct failure patterns? Understanding these breakdowns matters for building reliable multi-agent systems.

Does structured artifact sharing outperform conversational coordination?

Explores whether agents coordinating through standardized documents rather than natural language messages achieve better collaboration outcomes. Matters because it challenges the default conversational paradigm in multi-agent system design.

Can multiple agents stay diverse during training together?

Does training separate specialist agents on different data maintain the reasoning diversity that single-agent finetuning destroys? This matters because diversity correlates with accuracy and prevents models from becoming trapped in narrow response patterns.