INQUIRING LINE

Should LLMs query users back when presented with under-specified scenarios?

This explores whether the right response to a vague or incomplete prompt is for the LLM to ask the user a clarifying question — and what the corpus says about when that's warranted, why models resist it, and what they fail to notice without it.


This explores whether LLMs should clarify rather than guess when a request is under-specified. The corpus's strongest "yes" comes from borrowing a tool from conversation analysis: human dialogue handles ambiguity through *insert-expansions* — small clarifying detours before a request is acted on. When should AI agents ask users instead of just searching? argues these give agents a formal trigger for when to probe the user instead of silently chaining tool calls and drifting from intent. The point is preventive: a clarifying question stops a misunderstanding before it happens, rather than recovering from a wrong answer after the fact.

Why under-specification is dangerous in the first place is sharpened by two notes that don't share the question's vocabulary. Why do large language models produce generic responses to vague queries? reframes the generic, mealy-mouthed answer you get from a vague prompt not as the model being dumb, but as it falling back on blended training-data priors when the user hasn't supplied enough scaffolding — and explicitly names query verification and user-driven context specification as the fix. Do language models fail at identifying unstated preconditions? goes further: the failure isn't missing knowledge, it's failing to bring unstated preconditions forward as relevant. When models are forced to enumerate those hidden conditions, accuracy jumps from 30% to 85%. That's a striking number — it suggests that asking (or even self-asking) the right setup questions is where most of the lost performance lives.

But there's a behavioral obstacle to LLMs querying back, and it's the most counterintuitive thread here. Why do language models avoid correcting false user claims? shows models often *won't* challenge a false or shaky premise even when they privately know better — they've absorbed a human conversational norm of preserving social harmony and avoiding the awkwardness of correction. A model that won't correct a wrong assumption is unlikely to interrupt to ask for a missing one. Do language models ignore goals when surface cues conflict? compounds this: faced with conflict between salient surface cues and unstated constraints, models followed the surface cue 8 to 38 times more often — they barrel ahead on what's visible rather than flag what's missing. So the disposition to clarify has to be engineered against a default that pushes toward smooth, confident completion.

The corpus also hints the choice isn't binary between "ask the human" and "guess." Can models decide better than retrievers which tools to use? shows models emitting structured requests iteratively, refining what they need progressively as reasoning unfolds — a model interrogating its own under-specification across turns rather than in one shot. And Can structured argument prompts make LLM reasoning more rigorous? turns clarification inward: structured critical questions force the model to check warrants and surface implicit premises it would otherwise skip. Read together, "query the user back" is one node in a wider family of moves — query the user, query a tool, or query yourself — all aimed at making the unstated explicit before committing.

The synthesis: yes, querying back is the principled response to under-specification, and the payoff is large — but it runs against a trained-in reluctance to disturb conversational harmony, so it has to be deliberately designed in, and it's only the most visible member of a broader repertoire for dragging hidden assumptions into the open.


Sources 7 notes

When should AI agents ask users instead of just searching?

Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.

Why do large language models produce generic responses to vague queries?

Unlike social-media context collapse, which flattens multiple audiences, LLM collapse occurs when users provide insufficient contextual scaffolding and models default to blended training-data priors. This distinction suggests remedies should focus on query verification and user-driven context specification rather than platform controls.

Do language models fail at identifying unstated preconditions?

LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Do language models ignore goals when surface cues conflict?

Testing 14 LLMs on 500 conflict scenarios, the Heuristic Dominance Ratio ranged from 8.7× to 38×. Distance and other salient surface cues dominated decision-making over implicit feasibility constraints, producing sigmoid mappings largely independent of the stated objective.

Can models decide better than retrievers which tools to use?

MCP-Zero shows that letting models emit structured tool requests iteratively across conversations outperforms single-round semantic matching. The model can refine requirements progressively across domains as reasoning unfolds, bypassing colloquial-to-formal vocabulary mismatch.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether LLMs *should* query users back on under-specified scenarios. A curated library (2023–2026) found this to be principled and high-impact, but also fragile. Your job: separate durable constraints from claims that newer capability, training, or orchestration may have dissolved.

What a curated library found — and when (dated claims, not current truth):
• Insert-expansions (conversation-analysis formalism) flag when to clarify before acting; prevents downstream misalignment (~2023).
• Forcing enumeration of unstated preconditions lifts accuracy from ~30% to ~85%; framing failure as scaffolding collapse, not ignorance (~2024).
• Models avoid correcting false premises and follow salient surface cues 8–38× more often than unstated constraints; face-saving and heuristic bias drive reluctance to query back (~2024–2026).
• Self-querying (iterative tool refinement, structured critical questions) offers an alternative to direct user clarification (~2024–2026).
• Emotional framing and conversational alignment shift how models respond; tone-sensitivity may dampen clarification impulse (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2307.01644 (Insert-expansions, 2023)
• arXiv:2404.01869 (Reasoning Behavior Survey, 2024)
• arXiv:2506.08952 (Grounding & Loaded Questions, 2026)
• arXiv:2603.29025 (Surface Heuristics Override, 2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the face-saving bias and surface-heuristic dominance: have instruction-tuning, RLHF curriculum, or agent-loop scaffolding since relaxed these? Does proactive tool construction (arXiv:2506.01056) or multi-turn reasoning (arXiv:2505.20296) systematically *enable* back-querying? Separate the durable question (when should LLMs clarify?) from the perishable claim (they won't, due to conversational norms).
(2) Surface the strongest contradicting or superseding work from the last ~6 months — especially on whether newer models (o1-class reasoning, long-context, agentic) actually DO query back unprompted, or whether design wins (MCP tooling, explicit clarification rubrics) have overcome social-harmony pressure.
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., (a) Under what reasoning depth does back-querying emerge naturally without explicit instruction? (b) Does multi-agent or adversarial-setup increase clarification rate vs. single-agent friendly interaction?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines