INQUIRING LINE

How does ambiguity detection connect to models' ability to ask clarifying questions?

This explores whether a model noticing that something is unclear (ambiguity detection) is the same skill as — or a prerequisite for — actually asking a good clarifying question, and where the two come apart.


This explores whether a model noticing that something is unclear (ambiguity detection) is the same skill as actually asking a good clarifying question, and where the two come apart. The short version from the corpus: detection and asking are separable abilities, and most models are weak at the first and untrained at the second. On detection alone, the picture is grim — GPT-4 correctly disambiguates only 32% of deliberately ambiguous cases against 90% for humans, because it can't hold several interpretations in mind at once Can language models recognize when text is deliberately ambiguous?. So before a model can ask about ambiguity, it often fails to register that ambiguity exists.

But detection turns out to be the easier half. One of the sharpest findings here is that being good at solving a problem does not transfer to noticing what's missing from it: models that ace complete reasoning tasks fall to 40–50% when one variable is withheld and they must figure out what to ask Can models identify what information they actually need?. Information-gathering and problem-execution are different cognitive operations Can models identify what information they actually need?. The same gap shows up as a behavioral failure mode — reasoning models will overthink an ill-posed question for paragraphs rather than say 'this can't be answered,' because training rewards producing reasoning steps but never teaches when to disengage Why do reasoning models overthink ill-posed questions?.

Why don't models just ask? The corpus points to the reward structure, not a missing capability. Standard RLHF optimizes for the immediately helpful next turn, which actively discourages clarifying questions — answering now scores better than asking now even when asking would lead to a better outcome later Why do language models respond passively instead of asking clarifying questions?. Change the objective and the behavior appears: training for proactive critical thinking lifted the rate of correctly flagging flawed problems from near-zero to 74% Can models learn to ask clarifying questions instead of guessing?, and reframing training so conversation is treated as an information source makes clarifying questions emerge even though the model was only ever trained on fully-specified problems Can models learn to ask clarifying questions without explicit training?.

The most interesting twist is that detecting ambiguity and asking a *good* question are themselves separate. It's not enough to fire off any question — quality has structure. The ALFA framework breaks question quality into theory-grounded attributes like clarity, relevance, and specificity, and training on those attributes beats training on a single quality score, especially in clinical reasoning where the right question changes the decision Can models learn to ask genuinely useful clarifying questions?. And small models can be pushed toward better detection through structure rather than scale: a leader-follower debate where one agent proposes interpretations and others challenge them got Mistral-7B to 76.7% on ambiguity detection Can structured debate roles help small models detect ambiguity?.

So the chain is really three links, each a distinct failure point: register that the input is ambiguous, decide it's worth asking instead of guessing, and frame a question that actually closes the gap. A related thread worth pulling is *why* models barrel ahead on vague input — 'context collapse,' where instead of asking, the model fills the silence with blended training-data priors Why do large language models produce generic responses to vague queries?. Calibration sits underneath all of it: models that know when they're uncertain enough to abstain can match models ten times their size Can models learn to abstain when uncertain about predictions? — and abstaining is the quiet cousin of asking.


Sources 10 notes

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Can models identify what information they actually need?

Models achieving high accuracy on complete reasoning tasks drop to 40-50% accuracy identifying what clarifying question to ask when one variable is withheld. Information gathering and problem execution are separable cognitive operations.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Can models learn to ask clarifying questions instead of guessing?

Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.

Can models learn to ask clarifying questions without explicit training?

Models trained via SML on complete problems generalize to underspecified tasks by asking for needed information and delaying answers. The training paradigm instills a meta-strategy of using conversation as an information source, addressing the premature-answering failure mode.

Can models learn to ask genuinely useful clarifying questions?

The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.

Can structured debate roles help small models detect ambiguity?

Mistral-7B achieved 76.7% accuracy in ambiguity detection through a protocol where a leader proposes interpretations and two followers challenge them with rotating roles. Role rotation and consensus forcing prevent persuasive framing failures and create stronger verification than pairwise debate.

Why do large language models produce generic responses to vague queries?

Unlike social-media context collapse, which flattens multiple audiences, LLM collapse occurs when users provide insufficient contextual scaffolding and models default to blended training-data priors. This distinction suggests remedies should focus on query verification and user-driven context specification rather than platform controls.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about ambiguity detection and clarifying-question generation in LLMs. The question remains open: are these two capabilities linked, separable, or trainable as one unit?

What a curated library found — and when (findings span 2023–2026; treat as dated claims, not current truth):
• Detection and asking are separable skills; GPT-4 correctly disambiguates only 32% of ambiguous cases vs. 90% human baseline, failing to hold multiple interpretations (2023).
• Problem-solving ability does NOT transfer to noticing missing information: models achieving 90%+ on complete reasoning tasks drop to 40–50% when asked to identify what's withheld (2024).
• Standard RLHF actively discourages clarifying questions by rewarding immediate answers; reframing training objectives lifted flagging-flawed-problems rate from near-zero to 74% (2025).
• Question quality is itself decomposable: training on theory-grounded attributes (clarity, relevance, specificity) outperforms single-score training; multi-agent debate boosted Mistral-7B to 76.7% ambiguity detection (2025).
• Models "barrel ahead" on vague input via context collapse and miscalibration; abstention (refusing to answer) correlates with question-asking ability (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2304.14399 — We're Afraid Language Models Aren't Modeling Ambiguity (2023)
• arXiv:2507.12370 — Beyond Single Models: Enhancing LLM Detection of Ambiguity through Debate (2025)
• arXiv:2602.07338 — Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation (2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. Has newer training (DPO, IPO, online RL), larger context windows, or tool-use orchestration since relaxed the 32% / 40–50% detection gaps? Does multimodal grounding or retrieval-augmented reasoning change how models register ambiguity? Separate the durable question (likely: *training objective* is the bottleneck, not raw capability) from perishable limitations (detection ceiling, small-model debate gains). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Have any papers shown that end-to-end training on dialogue where clarifying-question-asking is *rewarded early* dissolves the detection-asking gap? Flag if newer work questions the "separability" claim.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If models now reliably detect ambiguity under the right training, what *new* failure mode emerges in the asking phase? (b) Does multi-agent scaffolding (debate, critique) generalize to open-domain conversations, or only synthetic ambiguity tasks?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines