INQUIRING LINE

How do agents discover and select which tools to invoke?

This explores how agents figure out which tools exist and pick the right one to call — and the corpus reveals a live debate over whether that choice should happen up front, on the fly, or be driven by the model itself.


This explores how agents figure out which tools exist and pick the right one to call. The corpus frames this less as a solved retrieval problem and more as an open design question with several competing answers — the most interesting split being *when* selection happens. The traditional approach pre-retrieves a fixed tool set before the task starts, but DeepAgent argues that discovering tools dynamically during execution works better for long, multi-step tasks: the agent keeps a global view and can change strategy mid-run instead of being locked into whatever it grabbed at the outset Can agents discover tools dynamically instead of pre-selecting them?. The tool space is often simply too large to enumerate in advance, so deferring the choice becomes a feature, not a compromise.

A second thread shifts *who* does the selecting. Rather than a passive retriever matching the user's phrasing to tool descriptions, MCP-Zero lets the model emit structured tool requests itself, refining what it needs across turns as its reasoning unfolds Can models decide better than retrievers which tools to use?. This sidesteps a quiet failure mode — the mismatch between how a user casually describes a need and the formal vocabulary a tool is registered under. The model, mid-reasoning, knows better than a one-shot semantic match what it's actually after.

But more selection freedom cuts both ways. A production-side note pushes back hard: protocol-mediated tool access (like MCP) introduced non-deterministic failures precisely through *ambiguous* tool selection and shaky parameter inference, and many teams restored reliability by going to explicit direct function calls with a single tool per agent Why do protocol-based tool integrations fail in production workflows?. So the corpus contains a genuine tension — the same flexibility that helps long-horizon exploration is the thing production engineers strip out to get predictable behavior.

Two adjacent ideas reframe the question entirely. One is that an agent shouldn't always reach for a tool at all — conversation analysis offers a formal account of when an agent should pause and *ask the user* instead of silently chaining tool calls and drifting from intent When should AI agents ask users instead of just searching?. The other is memory: agents can learn and store reusable sub-task routines from past runs, so 'which tool' becomes 'which proven workflow,' with measured gains as tasks repeat Can agents learn reusable sub-task routines from past experience?. Selection here is partly a learning problem, not just a retrieval one.

If you want the deeper structural view, two notes zoom out: decoupling reasoning from tool observations (planning the tool sequence before executing, à la ReWOO) changes the selection dynamic by separating *what to call* from *what came back* Can reasoning and tool execution be truly decoupled?, and representing agents as optimizable computational graphs suggests tool-routing decisions could be tuned automatically rather than hand-designed Can we automatically optimize both prompts and agent coordination?. The thing you might not have expected: there's no consensus that more autonomy in tool choice is better — the field is actively pulling between adaptive discovery and deterministic constraint.


Sources 7 notes

Can agents discover tools dynamically instead of pre-selecting them?

DeepAgent demonstrates that discovering tools as needed—rather than pre-retrieving a fixed set—enables agents to maintain global task perspective and adapt strategy mid-execution. This approach scales better for long-horizon tasks where the tool space is too large to enumerate.

Can models decide better than retrievers which tools to use?

MCP-Zero shows that letting models emit structured tool requests iteratively across conversations outperforms single-round semantic matching. The model can refine requirements progressively across domains as reasoning unfolds, bypassing colloquial-to-formal vocabulary mismatch.

Why do protocol-based tool integrations fail in production workflows?

MCP integration caused non-deterministic failures through ambiguous tool selection and parameter inference. Replacing it with explicit direct function calls and single-tool-per-agent design restored determinism. A 306-practitioner survey confirms 85% of production teams build custom agents, forgoing frameworks.

When should AI agents ask users instead of just searching?

Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Can reasoning and tool execution be truly decoupled?

ReWOO and Chain-of-Abstraction both decouple reasoning from tool responses through different mechanisms—planning-before-execution and abstract placeholders respectively—eliminating quadratic prompt growth and sequential latency while maintaining reasoning quality.

Can we automatically optimize both prompts and agent coordination?

Language agents represented as computational graphs—where nodes are operations and edges define information flow—reveal that CoT, ToT, and Reflexion are formally equivalent structures. This unified view enables automatic optimization of both node prompts and edge connectivity without manual redesign.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about agent tool discovery and selection. The question remains open: what mechanism—retrieval, learning, reasoning, or user negotiation—best governs which tools an agent invokes?

What a curated library found — and when (dated claims, not current truth): spanning 2023–2026, the corpus identified competing regimes:

• Dynamic tool discovery during execution outperforms pre-retrieved fixed sets for long tasks, enabling mid-run strategy shifts (DeepAgent, ~2025).
• Proactive model-driven tool requests (where the agent itself structures what it needs mid-reasoning) sidestep semantic mismatch between user phrasing and tool vocabularies (MCP-Zero, ~2026).
• Production workflows revert to deterministic single-tool-per-agent direct calls to eliminate non-deterministic failures from ambiguous selection and parameter inference (~2026).
• Formal conversation analysis supplies a framework for when agents should *ask the user* rather than silently chain tools (2023).
• Agent workflow memory treats selection as a learning problem, storing reusable sub-task routines from past runs (2024).

Anchor papers (verify; mind their dates): arXiv:2510.21618 (DeepAgent, 2025), arXiv:2506.01056 (MCP-Zero, 2026), arXiv:2512.08769 (Production Guide, 2026), arXiv:2409.07429 (Workflow Memory, 2024).

Your task:

(1) RE-TEST EACH CONSTRAINT. For dynamic vs. pre-retrieved, proactive vs. passive, and deterministic vs. adaptive regimes: have newer models (o1, o3, Claude 3.7+), improved prompting (chain-of-thought variants), or orchestration tooling (multi-agent loops, real-time memory, caching layers) since relaxed the production determinism–adaptivity tradeoff? Separately, has formal conversation analysis been operationalized in deployed agents, or does it remain theoretical? Flag which constraints still hold.

(2) Surface the strongest *disagreement* in the last 6 months: production engineers demand determinism; adaptive researchers prize discovery. What recent work (if any) reconciles this, or does the split persist by use-case tier?

(3) Propose 2 research questions that assume the regime may have moved: e.g., "Can learned tool-routing policies match hand-designed determinism at scale?" or "Do multi-turn user negotiation protocols outperform silent tool chains in safety-critical domains?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines