When and what should a model actually decide to delegate?
This explores two separate questions hiding inside delegation — the *when* (at which moments handing off is worth the overhead) and the *what* (which subtasks are actually good candidates) — and what the corpus says about whether models are any good at making those calls.
This explores two separate questions hiding inside "delegation" — the *when* (at which moments a handoff earns its cost) and the *what* (which pieces of work are good candidates) — and the corpus treats them as genuinely different problems, not one. The cleanest answer to *when* comes from work on selective deliberation: rather than agonizing over every step, an agent can sample several candidate actions and only stop to deliberate (or delegate) when those samples disagree — agreement signals a routine step you can just execute, divergence flags a genuinely uncertain decision point worth spending compute on When should an agent actually stop and deliberate?. That gives a concrete trigger: delegate at uncertainty, not on a schedule.
The *what* turns out to be the skill models are weakest at. There's a sharp finding that being good at solving fully-specified problems does **not** transfer to knowing what information you're missing — models that ace complete reasoning tasks drop to 40–50% when they have to figure out which clarifying question to ask Can models identify what information they actually need?. Deciding what to delegate is exactly this kind of meta-judgment about your own gaps, so the corpus quietly warns that a model's confidence about *what* to hand off is suspect. A related crack: models can articulate the right principle yet fail to execute it — knowledge and action run on dissociated pathways Can language models understand without actually executing correctly?. A good delegator needs both, and they don't always travel together.
When models *do* get to choose, letting them choose actively beats choosing for them. Proactive tool selection — where the model emits its own structured requests for tools as reasoning unfolds — outperforms a passive retriever guessing what it needs up front, because the need clarifies progressively Can models decide better than retrievers which tools to use?. The same logic shows up in retrieval, where a model's own partial answer reveals information gaps the original query couldn't express Can a model's partial response guide what to retrieve next?. The pattern across both: the right *what* often only becomes visible mid-task, so delegation works better as an iterative, model-driven loop than a one-shot plan.
Here's the surprise worth carrying away: delegation may be valuable less for who does the work than for what it teaches. Training a model to dispatch subtasks and integrate the summarized results turns out to teach disciplined decomposition and evidence-grounding — and that skill transfers back to single-agent tasks where there's no one to delegate to at all Can delegation teach models to manage context more actively?. Deciding what to delegate is really a form of active context management. The hard part isn't routing — full multi-agent routing systems already jointly optimize topology, agent count, roles, and model assignment What decisions must multi-agent routing systems optimize simultaneously?, and the agentic-RL framing makes these into learnable subsystems by treating the model as a policy over a multi-step world rather than a single-shot generator How does treating LLMs as multi-step agents change what we can optimize?. The hard part is the self-knowledge underneath: knowing, in the moment, that you're uncertain and what you're uncertain about.
Sources 8 notes
SAND uses self-consistency sampling to flag uncertainty: if N policy samples all match the expert action, skip deliberation; if they diverge, trigger execution-guided critiques. This step-level compute allocation lets agents deliberate only at genuinely uncertain decision points.
Models achieving high accuracy on complete reasoning tasks drop to 40-50% accuracy identifying what clarifying question to ask when one variable is withheld. Information gathering and problem execution are separable cognitive operations.
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
MCP-Zero shows that letting models emit structured tool requests iteratively across conversations outperforms single-round semantic matching. The model can refine requirements progressively across domains as reasoning unfolds, bypassing colloquial-to-formal vocabulary mismatch.
ITER-RETGEN shows that iteratively using generated responses as retrieval queries substantially improves performance on multi-hop reasoning and fact verification. Generation acts as both answer producer and information-need clarifier, surfacing implicit gaps that the original query missed.
SearchSwarm shows that training models to delegate subtasks and integrate summarized results beats passive compression, with a 30B model matching much larger ones. Critically, the delegation skill transfers to single-agent tasks, suggesting it teaches disciplined decomposition and evidence grounding, not just orchestration.
MasRouter shows that routing in multi-agent systems must jointly optimize collaboration topology, agent count, role allocation, and per-agent LLM assignment through a cascaded controller. This unified approach surpasses single-model routing by 3.51% accuracy while cutting HumanEval costs by 49%.
The Agentic RL survey shows that modeling LLMs as policies in Partially Observable MDPs rather than single-step generators makes memory, planning, and reasoning into RL-optimizable subsystems. This structural reframing explains the recent empirical convergence across memory-based agents, skill learning, and strategy distillation.