INQUIRING LINE

How does machine agency spectrum explain tool design mismatches with user behavior?

This explores how placing AI tools on a spectrum from passive (user-driven) to autonomous (agent-driven) explains why so many tools misjudge where users actually sit — and break when the tool assumes more or less agency than the user wants.


This reads the question as: tools fail not because they're badly built, but because they're pitched at the wrong point on a machine-agency spectrum — assuming more autonomy (or less) than the user's real behavior calls for. The corpus maps this mismatch from several angles at once. The clearest symptom is intent drift: when a tool grabs the autonomous end and silently chains actions, it loses the user. Agents fully align with what users want only about 20% of the time, and even the best models surface fewer than 30% of a user's preferences because they make premature assumptions instead of asking Why do AI agents miss most of what users actually want?. The fix isn't more autonomy — it's dialing agency back down at the right moments. Conversation analysis names exactly those moments: 'insert-expansions,' the small clarifying probes a tool should make before acting, so it prevents misunderstanding rather than recovering from it When should AI agents ask users instead of just searching?.

Why do designers misplace tools on the spectrum in the first place? Partly because the substrate they're designing on is invisible. AI runs on context that is mutable, dynamic, and ephemeral — prompt, history, retrieved data, hidden state — unlike the fixed context of conventional software that users can internalize How does AI context differ from conventional software context?. A user can't form a stable mental model of a tool whose state shifts under them, so they behave as if the tool is more legible (less agentic) than it is, and the tool behaves as if the user is more legible than they are. Both sides misread the other's position on the spectrum.

The engineering literature converges on the same lesson from the build side: reliability comes from pushing agency *out* of the model and into structure. Production teams find that protocol-mediated tool access (like MCP) introduces non-deterministic failures through ambiguous tool selection and parameter inference — and that explicit direct function calls with one tool per agent restore determinism Why do protocol-based tool integrations fail in production workflows?. That's a deliberate move *down* the agency spectrum: less inference, more constraint. The deeper version of this is externalizing memory, skills, and protocols into a harness layer instead of trusting model scale to figure them out on the fly Where does agent reliability actually come from?. The pattern is consistent — give the machine *less* discretion at the points where users need predictability.

What's quietly interesting is that the mismatch is baked in before deployment, at training time. Tools learn to call other tools from synthetic data built by random tool sampling and single-turn Q&A framing — which produces unrealistic compositions because unrelated tools can't credibly chain, and the framing ignores the multi-turn back-and-forth real use actually has Why does random tool sampling produce unrealistic synthetic training data?. So a tool can arrive already trained to behave at an agency level no real conversation occupies. Read together, the corpus suggests the 'agency spectrum' isn't a UX nicety — it's a design axis you can get wrong at the data layer, the integration layer, and the interaction layer, and every layer produces the same downstream symptom: a tool acting confidently at a point on the spectrum where the user isn't standing.


Sources 6 notes

Why do AI agents miss most of what users actually want?

UserBench measured multi-turn interactions where users reveal goals incrementally and found models achieve full intent alignment just 20% of the time. Even top models uncover fewer than 30% of user preferences through active querying, suggesting passivity and premature assumption-making are systematic failures.

When should AI agents ask users instead of just searching?

Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.

How does AI context differ from conventional software context?

AI interactions operate on a substrate of constantly shifting context—prompt, history, retrieved data, hidden state—that users cannot internalize like traditional UIs. This structural mutability demands a new design discipline centered on context engineering rather than interface design.

Why do protocol-based tool integrations fail in production workflows?

MCP integration caused non-deterministic failures through ambiguous tool selection and parameter inference. Replacing it with explicit direct function calls and single-tool-per-agent design restored determinism. A 306-practitioner survey confirms 85% of production teams build custom agents, forgoing frameworks.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Why does random tool sampling produce unrealistic synthetic training data?

Random tool sampling fails because unrelated tools cannot credibly compose, and Q&A framing ignores multi-turn dialogue coherence. ToolFlow shows that sampling tools from relevance graphs and generating with dialogue plans closes this gap.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a systems analyst re-examining whether machine agency spectrum mismatches still constrain tool design. The question: do tools fail because they assume the wrong level of autonomy, and can that mismatch be designed away?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat them as perishable constraints, not settled fact.
- Agents align with user intent only ~20% of the time; models surface <30% of user preferences due to premature action (2024–2025).
- Insert-expansions (clarifying probes before acting) formally prevent misunderstanding instead of recovering from it (~2024).
- AI context is mutable, dynamic, ephemeral — unlike fixed software context — so users can't form stable mental models and miscalibrate agency (2025).
- Production workflows require deterministic direct function calls, not protocol-mediated tool selection, to avoid ambiguous inference failures (~2025).
- Tool-calling training data using random tool sampling + single-turn framing produces unrealistic tool chains; real use is multi-turn (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2307.01644 (2023-07) — Insert-expansions for conversational agents.
- arXiv:2512.08769 (2025-12) — Production-grade agentic AI workflows.
- arXiv:2604.08224 (2026-04) — Externalization in LLM agents (memory, skills, protocols).
- arXiv:2507.22034 (2025-07) — UserBench: user-centric agent evaluation.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 20%/30% alignment figures, <30% preference surfacing, and mutable-context thesis: have newer training methods (instruction tuning, RLHF variants, constitutional approaches), deployment patterns (multi-turn dialogue harnesses, state management, guardrails), or evaluation frameworks (user-in-the-loop benchmarks like UserBench, arXiv:2507.22034) since relaxed these limits? Separate the durable question (users and tools do miscalibrate agency) from perishable claims (specific thresholds, specific failure modes). Cite what has or hasn't moved.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months: papers that show alignment *can* exceed 20%, or that mutable context *isn't* the bottleneck, or that synthetic tool-calling data *does* generalize when structured differently.
(3) Propose 2 research questions that assume the regime may have moved: e.g., "If deterministic orchestration solves reliability, what's the new failure mode?" or "Can agency mismatches be detected *during* interaction rather than fixed at design time?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines