Can models identify what information they actually need?
When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
QuestBench formalizes a capability that real-world deployment requires but benchmarks ignore: when a task is underspecified, can the model identify what information is missing and ask the right clarifying question?
The benchmark presents reasoning tasks (logic, planning, math) where exactly one piece of information is withheld. The model must select the correct clarification question from multiple options. The key finding: while current models excel on math variants (GSM-Q, GSME-Q), they achieve only 40-50% accuracy on Logic-Q and Planning-Q.
The critical insight is the separability result: models that solve the fully-specified version of a problem still fail to identify the right question when one variable is missing. Problem-solving capability and information-gathering capability are distinct cognitive operations. The ability to execute reasoning when all inputs are present does not transfer to recognizing which input is absent.
This extends Why do reasoning models overthink ill-posed questions? from a complementary angle. That note documents the BEHAVIORAL response to missing information (overthinking, redundant self-doubt). This documents the DIAGNOSTIC failure — models can't even identify what's missing, let alone respond appropriately. Together they describe a two-part deficit:
- Cannot detect what information is needed (QuestBench)
- Cannot disengage when information is absent (missing premises overthinking)
The connection to Can language models recognize when text is deliberately ambiguous? is structural: both involve recognizing that the current input is insufficient for a definitive answer. Ambiguity recognition asks "is this input multiply interpretable?" while information gathering asks "is this input incomplete?" Both require meta-reasoning about the input rather than reasoning within it.
The formalization as a constraint satisfaction problem (CSP) with missing variable assignments is useful: it defines information gathering as identifying the minimal necessary question — a well-defined optimization target. This separates the problem from subjective clarification tasks where multiple valid questions exist.
Inquiring lines that use this note as a source 20
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can AI systems identify important unanswered questions that emerge during reasoning?
- Can models identify what information they are missing in underspecified problems?
- How do humans decide which level of clarification to request?
- Can models identify information gaps without just guessing or refusing to answer?
- What makes some clarifying questions more useful than others?
- What specific information must be exported from the language system?
- How does ambiguity detection connect to models' ability to ask clarifying questions?
- Can models learn to identify what information is missing from questions?
- Do reasoning models overthink ill-posed questions instead of recognizing incompleteness?
- Can models identify what information they are missing in underspecified tasks?
- How does proactive critical thinking enable models to identify missing information?
- Can language models ask clarifying questions when sentences are ambiguous?
- When should a system decide to retrieve versus reason alone?
- What training approach enables models to proactively request clarification?
- Can models distinguish between ambiguous and incomplete information inputs?
- Can models learn to stop thinking when a question lacks necessary information?
- How does proactive critical thinking detect when information is incomplete?
- Can reasoning models reject ill-posed questions or do they overthink?
- How can agents detect missing information before attempting to solve problems?
- Which types of clarifying questions actually help users versus wasting their time?
Related concepts in this collection 11
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do reasoning models overthink ill-posed questions?
Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
behavioral response to missing info; this is the diagnostic failure
-
Can language models recognize when text is deliberately ambiguous?
Explores whether LLMs can identify and handle multiple valid interpretations in a single phrase—a core human language skill that appears largely absent in current models despite their fluency on standard tasks.
shared structure: recognizing input insufficiency
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
reasoning training suppresses both abstention and information gathering
-
Why do LLMs struggle to connect unrelated entities speculatively?
LLMs reliably organize and summarize evidence but fail when asked to speculate about connections between dissimilar entities. Understanding this failure could reveal fundamental limits in how models handle complex analytical reasoning.
evidence organization (well-specified) vs hypothesis generation (underspecified) is the same split
-
Can models learn to ask clarifying questions instead of guessing?
Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
proactive critical thinking is the trainable solution to the information-gathering deficit: RL training raises missing-information detection from 0.15% to 73.98%, directly addressing the capability gap QuestBench identifies
-
How do users actually form intent when prompting AI systems?
Users face a 'gulf of envisioning'—they must simultaneously imagine possibilities and express them to language models. This cognitive gap creates breakdowns not from AI incapability but from users struggling to articulate what they truly need.
intent maturation requires recognizing what information is missing from underspecified user expressions, which is exactly the capability QuestBench shows models lack
-
Why do language models lose performance in longer conversations?
Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.
the Mediator-Assistant architecture addresses the QuestBench deficit by separating intent understanding (where missing-information detection is needed) from task execution (where well-specified reasoning suffices)
-
Does training objective determine which direction models fail at abstention?
Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.
under-abstention compounds the underspecification problem: reasoning-trained models are both unable to identify missing information (this note) and trained to force answers regardless (that note), creating a compound failure on underspecified inputs
-
Why do language models fail in gradually revealed conversations?
Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
the conversational manifestation of the information-gathering deficit: when instructions arrive gradually (the normal case), models that cannot identify what's missing make premature assumptions instead, producing the 39% multi-turn degradation
-
Why do users drift away from their original information need?
When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
the user-side complement: QuestBench shows AI cannot identify what information is missing; ASK shows users cannot articulate what knowledge they lack; when both sides of the interaction have information-gathering deficits, neither can help the other resolve underspecification
-
Why do AI agents miss most of what users actually want?
UserBench explores why current models align with user intent only 20% of the time, even when users reveal preferences across multiple turns. The question examines whether agents can learn to actively clarify ambiguous or evolving goals.
UserBench quantifies the practical cost of the information-gathering deficit: models that cannot identify missing information from underspecified tasks achieve only 20% full intent alignment because three core traits of user communication (underspecification, incrementality, indirectness) demand exactly the capability QuestBench shows models lack
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
- Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
- Divide-or-Conquer? Which Part Should You Distill Your LLM?
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Original note title
solving well-specified reasoning problems is insufficient for identifying missing information in underspecified tasks