INQUIRING LINE

Inquiring lines›What do model internals reveal abo…›What internal gaps exist between L…›When should tasks involve human-AI…›this inquiring line

40% of the time, AI does a completely different job than what you actually asked for.

What tasks do users actually want AI to handle versus what can it automate?

This explores the gap between the tasks people genuinely want help with and the tasks AI is being built to take over — and finds the two often don't match.

This explores the gap between what users actually want AI to do and what gets automated — a mismatch that turns out to be the dominant theme in the corpus, not the exception. The starting fact is stark: an analysis of 200,000 Bing Copilot conversations found that user goals and AI actions were entirely disjoint in 40% of cases — users came for information gathering and writing help, but the model defaulted to coaching, advising, and teaching Why does AI default to coaching instead of doing?. That's not a capability gap; it's a baked-in behavioral default pointing the wrong direction.

The misalignment scales up to whole product strategies. The personal-assistant dream — automate your email, your calendar, your scheduling — appeals to a narrow slice of time-pressured professionals, not the general user, because most people value the engagement those routine tasks provide rather than wanting them taken away Does the personal assistant model actually serve most users?. A survey of 1,500 workers across 844 tasks sharpens the point: equal human-AI partnership, not handoff, was the most-desired arrangement in 45% of occupations — yet 41% of startup investment targets zones that don't match those preferences What collaboration level do workers actually want with AI?. People are asking for a collaborator; the market keeps building a replacement.

Part of the problem is that users themselves can't always say what they want up front. Intent isn't a fixed thing the AI just needs to read — it matures through interaction, resolving constraint by constraint with fluctuating stability How do users actually form intent when prompting AI systems?. There's a 'gulf of envisioning' where users can't articulate requirements and AI, which responds rather than probes, fails to help them discover what they meant Why can't users articulate what they want from AI?. The cost shows up in measurement: across multi-turn tasks where goals are revealed gradually, models hit full intent alignment only 20% of the time, and even the best uncover under 30% of user preferences — failing mostly by staying passive and assuming too early Why do AI agents miss most of what users actually want?. One promising fix borrows 'insert-expansions' from conversation analysis: a formal account of when an agent should pause and ask rather than silently chain tools toward the wrong target When should AI agents ask users instead of just searching?.

So what can AI reliably automate? The corpus draws a sharp line, and it's about verifiability, not difficulty. AI excels at structured tasks an external oracle can check — literature retrieval, drafting — and fails sharply on novel ideas and genuine judgment Where does AI assistance become unreliable in research?. Productivity gains are real but conditional: they appear when workers apply skills they already have, and evaporate (while damaging learning) when AI is used to acquire new ones When does AI actually boost worker productivity?. And raw capability isn't the bottleneck anyway — agentic systems complete only about 30% of real workplace tasks, with success hinging on trust, standardization, and interaction design as much as model strength What breaks when specialized AI models reach real users?.

The synthesis that emerges is that the 'want vs. can' framing has a third term the research keeps landing on: targeted collaboration beats both extremes. A confidence-routed system that interrupts a human only at high-leverage decision points hit 87.5% acceptance, crushing both full autonomy (25%) and constant step-by-step oversight (50%) — because selective interruption avoids both uncaught critical errors and the incoherence of being micromanaged Does targeted human intervention outperform both full autonomy and exhaustive oversight?. The thing readers may not expect: the frontier isn't 'automate more,' it's knowing precisely when to hand off and when to ask — and even autonomous-science frameworks stall hardest on self-correction, the one capability that requires judgment rather than execution What capabilities do AI systems need for autonomous science?.

Sources 12 notes

Why does AI default to coaching instead of doing?

Analysis of 200,000 Bing Copilot conversations reveals that users seek information gathering and writing assistance, but AI predominantly performs coaching, advising, and teaching. In 40% of cases, user goals and AI actions are entirely disjoint sets, suggesting a structural training default rather than a capability gap.

Does the personal assistant model actually serve most users?

Most users do not want routine tasks like email and calendar automated; they value the engagement these tasks provide. Products over-invest in assistant features calibrated to time-pressured professionals rather than typical user needs.

What collaboration level do workers actually want with AI?

The HumanAgency Scale survey of 1,500 workers across 844 tasks found that equal partnership (H3) is the dominant desired level in 45% of occupations. Yet 41% of startup investments target zones misaligned with these worker preferences.

How do users actually form intent when prompting AI systems?

Human intent matures through progressive constraint resolution with fluctuating stability, not as a simple present-or-absent condition. The STORM framework and Clarify metric reveal that AI systems fail partly because they cannot access users' internal cognitive states during this evolution.

Why can't users articulate what they want from AI?

Intent develops through interaction, not in isolation. Since AI models respond rather than probe, they miss opportunities to help users discover unarticulated requirements. Structured dialogue that presents model-generated options shifts the cognitive burden from open-ended envisioning to constrained evaluation.

Show all 12 sources

Why do AI agents miss most of what users actually want?

UserBench measured multi-turn interactions where users reveal goals incrementally and found models achieve full intent alignment just 20% of the time. Even top models uncover fewer than 30% of user preferences through active querying, suggesting passivity and premature assumption-making are systematic failures.

When should AI agents ask users instead of just searching?

Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.

Where does AI assistance become unreliable in research?

AI excels at structured, externally verifiable tasks like literature retrieval and drafting, but fails sharply on novel ideas and scientific judgment. The boundary consistently tracks whether an external oracle can verify the output—a principle that remains stable even as specific task assignments shift.

When does AI actually boost worker productivity?

Studies showing AI productivity gains measured tasks within workers' existing domains. When workers used AI to learn new skills, productivity gains disappeared and learning suffered, suggesting prior findings do not generalize to skill acquisition.

What breaks when specialized AI models reach real users?

Agentic systems complete only 30% of real workplace tasks despite strong capability, while routing decisions outperform individual frontier models and generative interfaces outperform chat 70% of the time. Success depends on standardization, trust, and interaction design as much as raw model performance.

Does targeted human intervention outperform both full autonomy and exhaustive oversight?

AutoResearchClaw's confidence-routed CoPilot mode achieved 87.5% acceptance, substantially outperforming full autonomy (25%) and step-by-step oversight (50%). The key insight: selective interruption avoids both uncaught critical errors and the coherence degradation caused by constant human interruption.

What capabilities do AI systems need for autonomous science?

The Virtuous Machines framework identifies hypothesis generation, experimental design, data analysis, and iterative self-correction as essential for autonomous scientific research, none of which standard LLM benchmarks reliably evaluate. Self-correction poses the deepest challenge due to documented degradation in reasoning accuracy.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **What tasks do users actually want AI to handle versus what can it automate?** — remains open. A curated library spanning 2023–2026 found:

**What a curated library found — and when (dated claims, not current truth):**
- User goals and AI outputs are entirely disjoint in 40% of workplace conversations; users seek information and writing help, but models default to coaching and advising (~2024).
- Most workers prefer equal partnership (45% across occupations) over full automation handoff, yet 41% of startup investment targets autonomous replacement (~2025).
- Agents align with all user intents only ~20% of the time in multi-turn tasks; better agents still miss over 70% of user preferences, failing by staying passive (~2025).
- A confidence-routed system interrupting humans only at high-leverage decision points achieved 87.5% acceptance, outperforming full autonomy (25%) and step-by-step oversight (50%) (~2026).
- AI reliably automates only verifiable, structured tasks (retrieval, drafting); productivity gains vanish when workers use AI to *learn* rather than apply existing skills (~2026).

**Anchor papers (verify; mind their dates):**
- arXiv:2309.14459 (2023) — gulf of envisioning in LLM interfaces.
- arXiv:2506.06576 (2025) — auditing automation vs. augmentation potential.
- arXiv:2601.20245 (2026) — AI's impact on skill formation.
- arXiv:2605.20025 (2026) — self-reinforcing human-AI research collaboration.

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For the 40% disjunction, the partnership preference, and the 20% intent-alignment ceiling: has tooling (memory, long-context, MCP standards), training (RLHF variants, preference modeling), or orchestration (probing agents, explicit handoff logic) since relaxed these limits? Separate the durable question (how to *design* for collaboration) from the perishable claim (current model behavior). Cite what changed.
(2) **Surface strongest CONTRADICTING work from ~last 6 months.** Look for papers claiming either (a) automation has solved the intent-alignment problem, or (b) user preferences for partnership were misdiagnosed. Flag any tension.
(3) **Propose 2 research questions assuming the regime has moved:** (a) If intent-alignment improved via probing, does the partnership-preference gap shrink or persist for other reasons (trust, skill deskilling, control)? (b) Does the 87.5% acceptance for selective interruption hold across occupations with asymmetric expertise?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

40% of the time, AI does a completely different job than what you actually asked for.

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8