Can models learn to ask clarifying questions instead of guessing?
Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
Current LLMs face three failure modes when receiving flawed or incomplete queries: they hallucinate an answer, they refuse to respond, or they provide a generic "I need more information" deflection. None of these is productive. The proactive critical thinking paradigm introduces a fourth option: identify specifically what is missing and generate a targeted question to request it.
The GSM-MC benchmark tests this by deliberately removing key variables from math problems. Results are dramatic:
- Vanilla models: 0.15% accuracy on proactive critical thinking tasks
- After RL training: 73.98% accuracy (Qwen3-1.7B)
- SFT alone: effective but RL is generally superior
The near-zero baseline reveals something important: despite extensive post-training that makes these models excellent at reasoning, they have almost no ability to detect when a problem is ill-posed and actively seek the missing piece. This is a specific capability gap, not a general reasoning limitation.
A striking secondary finding: inference-time scaling (activating "thinking mode") actually degrades proactive critical thinking in vanilla models. The extended thinking induces "counterproductive self-doubt rather than useful analysis, leading to a clear drop in performance." But after RL training, thinking mode becomes beneficial — the same mechanism that hurts untrained models helps trained ones.
This finding matters beyond math: a patient omitting critical symptoms, a user providing incomplete specifications, a student asking an ambiguous question — all require the agent to identify what's missing and ask, not just refuse or guess. Since Why can't conversational AI agents take the initiative?, proactive critical thinking is a concrete, trainable instantiation of the broader proactivity gap.
ProCoT (Proactive Chain-of-Thought) extends the paradigm from individual queries to multi-turn goal planning: rather than just detecting missing information in a single exchange, models generate explicit reasoning chains about conversation goals and plan proactive interventions across turns. This bridges proactive critical thinking (reactive: "this query is incomplete") with proactive dialogue (strategic: "given the user's goal, I should ask about X before they realize they need it").
The ALFA framework for clinical reasoning extends this by showing that question quality is multidimensional — a question can be clear but irrelevant, or relevant but ambiguous. ALFA decomposes "good question" into theory-grounded attributes (clarity, relevance, specificity) and trains against each via 80K attribute-specific preference pairs. This addresses a gap: proactive critical thinking shows models can learn to ask, but ALFA shows they need attribute-specific training to ask well. Additionally, research on clarifying question design shows that specific-facet questions ("What type of monitor?") consistently outperform need-rephrasing questions ("Can you be more specific?") for user satisfaction — the form of the question matters as much as the decision to ask.
Inquiring lines that use this note as a source 60
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can dialogue systems abstain from responding when uncertainty is too high?
- Can AI systems identify important unanswered questions that emerge during reasoning?
- Why can't language models conduct genuine Socratic questioning in therapy sessions?
- Can proactive critical thinking alone enable models to request clarification effectively?
- Can language systems learn when to ask for clarification instead of choosing one reading?
- Why do conversational queries drift away from what triggered them?
- Why do dialogue systems fail to detect declarative clarification requests?
- How do humans decide which level of clarification to request?
- Why do longer queries benefit less from clarification questions?
- Can models identify information gaps without just guessing or refusing to answer?
- Why does adding more conversational data fail to improve maintenance skills?
- Can models infer maintenance operations from conversational text data alone?
- Can real-time detection identify when users have incomplete or underdeveloped intent?
- Why do large language models fail at taking conversational initiative?
- Can AI learn when to speak in a conversation?
- When should agents use clarification commands instead of assuming intent?
- Can language models understand the implicit emotional intent behind questions?
- Why do pretrained retrievers struggle with ambiguous or implicit queries?
- Can conversation analysis predict when agents should ask users for clarification?
- Can proactive critical thinking train models to request clarification actively?
- How does ambiguity detection connect to models' ability to ask clarifying questions?
- How should systems reject queries outside their trained domain?
- What data would be needed to train proactive conversational systems?
- What structural changes enable agents to ask clarifying questions?
- Can models learn to identify what information is missing from questions?
- Do reasoning models overthink ill-posed questions instead of recognizing incompleteness?
- Can models identify what information they are missing in underspecified tasks?
- How does proactive critical thinking enable models to identify missing information?
- Can attribute-specific preference optimization improve question quality in information-seeking?
- Why do weaker language models fail at multi-turn strategic questioning?
- Can language models ask clarifying questions when sentences are ambiguous?
- Why do chatbots fail to recognize when someone is ambivalent about change?
- Do models trained for reasoning lose their ability to decline questions?
- Can question quality be trained separately from the decision to ask?
- How do conversational agents overcome structural passivity and goal awareness gaps?
- What distinguishes proactive information provision from proactive clarification seeking?
- Can models learn when to think versus answer directly?
- Why do language models prefer certain response styles regardless of what the prompt asks?
- Can language models recognize when to ignore off-topic information in conversations?
- What training approach enables models to proactively request clarification?
- Can models distinguish between ambiguous and incomplete information inputs?
- Can models learn to stop thinking when a question lacks necessary information?
- Can reinforcement learning teach AI when to ask clarifying questions?
- Can models be trained to explain instead of imitate answers?
- Why does selective conversation history outperform including all prior context?
- Can AI take initiative by questioning without being proactive in directive ways?
- What communicative work do fluent conversations perform that AI systems skip?
- Why do conversational agents lack the goal awareness needed to lead rather than just respond?
- Why do specific clarifying questions outperform rephrased versions of user needs?
- How does proactive information-gathering capability differ from passive knowledge retrieval?
- Why do specific clarifying questions outperform generic requests for clarity?
- How does proactive critical thinking detect when information is incomplete?
- Why do models struggle with asking questions in multi-turn conversational reasoning tasks?
- Can Q-priming further strengthen clarifying question behavior beyond social meta-learning alone?
- What training objectives could reduce completion bias in autonomous agents?
- Can models learn to ask clarifying questions instead of making assumptions?
- How can agents detect missing information before attempting to solve problems?
- Why do language models overthink simple questions when given extra time?
- Do models naturally learn to ask clarifying questions without explicit supervision?
- Which types of clarifying questions actually help users versus wasting their time?
Related concepts in this collection 15
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why can't conversational AI agents take the initiative?
Explores whether current LLMs lack the structural ability to lead conversations, set goals, or anticipate user needs—and what architectural changes might enable proactive dialogue.
proactive critical thinking is a specific trainable form of the general proactivity gap
-
Can models identify what information they actually need?
When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
QuestBench confirms: well-specified reasoning ≠ missing-information detection
-
When does explicit reasoning actually help model performance?
Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
thinking mode degrades proactive questioning in vanilla models (another case of reasoning-type mismatch)
-
Can models learn to ask genuinely useful clarifying questions?
Explores whether question-asking quality is teachable through decomposing it into specific attributes like clarity and relevance, rather than treating it as a monolithic skill.
ALFA provides the quality methodology for making clarifications effective
-
Which clarifying questions actually improve user satisfaction?
Not all clarification helps equally. This explores whether asking users to rephrase their needs works as well as asking targeted questions about specific information gaps.
question form matters as much as decision to ask
-
Can RL agents learn to reason better, not just succeed?
Standard outcome-only RL rewards agents for any successful trajectory, even flawed ones. Can we instead train agents to demonstrate genuine reasoning quality by rewarding the metacognitive process itself?
complementary metacognitive RL: RLVMR trains monitoring/reflection during agentic execution; proactive critical thinking trains missing-information detection before reasoning begins; both operationalize metacognition as trainable RL objectives
-
When should AI agents ask users instead of just searching?
Explores whether tool-enabled LLMs should probe users for clarification when uncertain, rather than silently chaining tool calls that drift from intent. Examines conversation analysis patterns as a formal alternative.
CA's insert-expansion framework provides the conversational structure (pre-second, post-first) for deploying proactive questioning in dialogue contexts
-
Why do reasoning models overthink ill-posed questions?
Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
describes the behavioral failure proactive critical thinking corrects: without training, models ruminate unproductively on missing-premise questions; RL training transforms counterproductive self-doubt into targeted clarification
-
How can models select the most informative question to ask?
Explores whether simulating possible futures and scoring questions by information gain can identify which clarifying question would best reduce uncertainty—moving beyond just deciding whether to ask toward deciding what to ask.
complementary capability: proactive critical thinking detects THAT information is missing; UoT determines WHICH question most efficiently recovers it
-
What makes strategic question-asking succeed or fail?
Explores whether excellent performance at multi-turn questioning requires one dominant skill or the coordinated interaction of multiple distinct capabilities. Matters because many real-world tasks (diagnosis, troubleshooting, clarification) depend on this ability.
20Q reveals the three-capability synergy needed beyond mere detection: state tracking, planning, and inductive reasoning must work together
-
Why do language models lose performance in longer conversations?
Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.
the trainable capability complement to the Mediator-Assistant architecture: proactive questioning addresses the intent alignment gap from the capability side while the Mediator addresses it architecturally
-
Can AI agents learn when they have something worth saying?
What if AI proactivity came from modeling intrinsic motivation to participate rather than predicting who speaks next? This explores whether a framework based on human cognitive patterns—internal thought generation parallel to conversation—can make agents genuinely responsive rather than passively reactive.
complementary proactivity approaches: proactive critical thinking trains the capability to detect missing information; Inner Thoughts provides the motivational architecture for deciding when to deploy it in social conversation contexts
-
Can conversations themselves personalize without user profiles?
Can a conversational AI learn about user traits and adapt in real time by rewarding itself for asking insightful questions, rather than relying on pre-collected profiles or historical data?
complementary uncertainty reduction: proactive critical thinking detects missing task information; curiosity reward reduces uncertainty about who the user is; both reward active information-seeking over passive response generation, but targeting different knowledge gaps (task-level vs user-level)
-
Why do users drift away from their original information need?
When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
the user-side complement: proactive critical thinking trains the AI to detect missing information, but ASK shows users themselves cannot articulate what they lack; combining ASK detection (84% precision) with proactive questioning could intervene before topic drift compounds the underspecification
-
Why do AI agents miss most of what users actually want?
UserBench explores why current models align with user intent only 20% of the time, even when users reveal preferences across multiple turns. The question examines whether agents can learn to actively clarify ambiguous or evolving goals.
UserBench quantifies the cost of absent proactive questioning: the <30% preference discovery rate confirms that current models lack the proactive critical thinking needed to surface underspecified user intents
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration
- Can Large Language Models Reason and Optimize Under Constraints?
- From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?
- Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning
- LLMs can implicitly learn from mistakes in-context
- Proactive Conversational Agents in the Post-ChatGPT World
- Learning to Learn from Language Feedback with Social Meta-Learning
- Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?
Original note title
proactive critical thinking enables models to identify missing information and actively request clarification rather than passively refusing or hallucinating answers