SYNTHESIS NOTE

Can models learn to ask clarifying questions instead of guessing?

Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.

Synthesis note · 2026-02-22 · sourced from Conversation Agents

Current LLMs face three failure modes when receiving flawed or incomplete queries: they hallucinate an answer, they refuse to respond, or they provide a generic "I need more information" deflection. None of these is productive. The proactive critical thinking paradigm introduces a fourth option: identify specifically what is missing and generate a targeted question to request it.

The GSM-MC benchmark tests this by deliberately removing key variables from math problems. Results are dramatic:

Vanilla models: 0.15% accuracy on proactive critical thinking tasks
After RL training: 73.98% accuracy (Qwen3-1.7B)
SFT alone: effective but RL is generally superior

The near-zero baseline reveals something important: despite extensive post-training that makes these models excellent at reasoning, they have almost no ability to detect when a problem is ill-posed and actively seek the missing piece. This is a specific capability gap, not a general reasoning limitation.

A striking secondary finding: inference-time scaling (activating "thinking mode") actually degrades proactive critical thinking in vanilla models. The extended thinking induces "counterproductive self-doubt rather than useful analysis, leading to a clear drop in performance." But after RL training, thinking mode becomes beneficial — the same mechanism that hurts untrained models helps trained ones.

This finding matters beyond math: a patient omitting critical symptoms, a user providing incomplete specifications, a student asking an ambiguous question — all require the agent to identify what's missing and ask, not just refuse or guess. Since Why can't conversational AI agents take the initiative?, proactive critical thinking is a concrete, trainable instantiation of the broader proactivity gap.

ProCoT (Proactive Chain-of-Thought) extends the paradigm from individual queries to multi-turn goal planning: rather than just detecting missing information in a single exchange, models generate explicit reasoning chains about conversation goals and plan proactive interventions across turns. This bridges proactive critical thinking (reactive: "this query is incomplete") with proactive dialogue (strategic: "given the user's goal, I should ask about X before they realize they need it").

The ALFA framework for clinical reasoning extends this by showing that question quality is multidimensional — a question can be clear but irrelevant, or relevant but ambiguous. ALFA decomposes "good question" into theory-grounded attributes (clarity, relevance, specificity) and trains against each via 80K attribute-specific preference pairs. This addresses a gap: proactive critical thinking shows models can learn to ask, but ALFA shows they need attribute-specific training to ask well. Additionally, research on clarifying question design shows that specific-facet questions ("What type of monitor?") consistently outperform need-rephrasing questions ("Can you be more specific?") for user satisfaction — the form of the question matters as much as the decision to ask.

Inquiring lines that read this note 61

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How should dialogue systems represent uncertainty from noisy speech input?

Can dialogue systems abstain from responding when uncertainty is too high?

How can models identify insufficient information and respond appropriately without guessing?

Why do LLM chatbots fail as independent therapeutic agents?

Why can't language models conduct genuine Socratic questioning in therapy sessions?

Why do language models reinforce false assumptions instead of correcting them?

How can language models sustain linguistic synchrony and intersubjectivity during dialogue?

How should conversational agents balance goal-driven initiative with user control?

What makes dialogue-based explanation more successful than monologue?

How do humans decide which level of clarification to request?

What makes specific clarifying questions more effective than generic ones?

Why do multi-turn conversations degrade AI intent and coherence?

How do formal dialogue structures reveal conversation coherence mechanisms?

Can models infer maintenance operations from conversational text data alone?

How can emotions function as reliable information in reasoning and cognitive systems?

Can language models understand the implicit emotional intent behind questions?

How should retrieval systems optimize for multi-step reasoning during inference?

How do chatbots affect human self-disclosure and emotional engagement?

Why do chatbots fail to recognize when someone is ambivalent about change?

What capability tradeoffs emerge when scaling model reasoning abilities?

Can models learn when to think versus answer directly?

What prevents language models from reliably adopting diverse personas?

Why do language models prefer certain response styles regardless of what the prompt asks?

Does reinforcement learning teach reasoning or just when to reason?

Can reinforcement learning teach AI when to ask clarifying questions?

How do training data properties shape reasoning capability development?

Can models be trained to explain instead of imitate answers?

How should dialogue systems best leverage conversation history for retrieval?

How should iterative research systems allocate reasoning per search step?

How does proactive information-gathering capability differ from passive knowledge retrieval?

How do training priors constrain what context information can override?

Can Q-priming further strengthen clarifying question behavior beyond social meta-learning alone?

Why do agents confidently report success despite actually failing tasks?

What training objectives could reduce completion bias in autonomous agents?

When do additional thinking tokens stop improving reasoning performance?

Why do language models overthink simple questions when given extra time?

Related concepts in this collection 15

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

26 direct connections · 182 in 2-hop network ·medium cluster Open in graph ↗

Can models learn to ask clarifying questions ins… Why can't conversational AI agents take the initia… Can models identify what information they actually… When does explicit reasoning actually help model p… Can models learn to ask genuinely useful clarifyin… Which clarifying questions actually improve user s… Can RL agents learn to reason better, not just suc… When should AI agents ask users instead of just se… Why do reasoning models overthink ill-posed questi…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why can't conversational AI agents take the initiative? Explores whether current LLMs lack the structural ability to lead conversations, set goals, or anticipate user needs—and what architectural changes might enable proactive dialogue.
proactive critical thinking is a specific trainable form of the general proactivity gap
Can models identify what information they actually need? When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
QuestBench confirms: well-specified reasoning ≠ missing-information detection
When does explicit reasoning actually help model performance? Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
thinking mode degrades proactive questioning in vanilla models (another case of reasoning-type mismatch)
Can models learn to ask genuinely useful clarifying questions? Explores whether question-asking quality is teachable through decomposing it into specific attributes like clarity and relevance, rather than treating it as a monolithic skill.
ALFA provides the quality methodology for making clarifications effective
Which clarifying questions actually improve user satisfaction? Not all clarification helps equally. This explores whether asking users to rephrase their needs works as well as asking targeted questions about specific information gaps.
question form matters as much as decision to ask
Can RL agents learn to reason better, not just succeed? Standard outcome-only RL rewards agents for any successful trajectory, even flawed ones. Can we instead train agents to demonstrate genuine reasoning quality by rewarding the metacognitive process itself?
complementary metacognitive RL: RLVMR trains monitoring/reflection during agentic execution; proactive critical thinking trains missing-information detection before reasoning begins; both operationalize metacognition as trainable RL objectives
When should AI agents ask users instead of just searching? Explores whether tool-enabled LLMs should probe users for clarification when uncertain, rather than silently chaining tool calls that drift from intent. Examines conversation analysis patterns as a formal alternative.
CA's insert-expansion framework provides the conversational structure (pre-second, post-first) for deploying proactive questioning in dialogue contexts
Why do reasoning models overthink ill-posed questions? Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
describes the behavioral failure proactive critical thinking corrects: without training, models ruminate unproductively on missing-premise questions; RL training transforms counterproductive self-doubt into targeted clarification
How can models select the most informative question to ask? Explores whether simulating possible futures and scoring questions by information gain can identify which clarifying question would best reduce uncertainty—moving beyond just deciding whether to ask toward deciding what to ask.
complementary capability: proactive critical thinking detects THAT information is missing; UoT determines WHICH question most efficiently recovers it
What makes strategic question-asking succeed or fail? Explores whether excellent performance at multi-turn questioning requires one dominant skill or the coordinated interaction of multiple distinct capabilities. Matters because many real-world tasks (diagnosis, troubleshooting, clarification) depend on this ability.
20Q reveals the three-capability synergy needed beyond mere detection: state tracking, planning, and inductive reasoning must work together
Why do language models lose performance in longer conversations? Does multi-turn degradation stem from fundamental model limitations, or from misalignment between what users mean and what models assume? Understanding the root cause could guide better solutions.
the trainable capability complement to the Mediator-Assistant architecture: proactive questioning addresses the intent alignment gap from the capability side while the Mediator addresses it architecturally
Can AI agents learn when they have something worth saying? What if AI proactivity came from modeling intrinsic motivation to participate rather than predicting who speaks next? This explores whether a framework based on human cognitive patterns—internal thought generation parallel to conversation—can make agents genuinely responsive rather than passively reactive.
complementary proactivity approaches: proactive critical thinking trains the capability to detect missing information; Inner Thoughts provides the motivational architecture for deciding when to deploy it in social conversation contexts
Can conversations themselves personalize without user profiles? Can a conversational AI learn about user traits and adapt in real time by rewarding itself for asking insightful questions, rather than relying on pre-collected profiles or historical data?
complementary uncertainty reduction: proactive critical thinking detects missing task information; curiosity reward reduces uncertainty about who the user is; both reward active information-seeking over passive response generation, but targeting different knowledge gaps (task-level vs user-level)
Why do users drift away from their original information need? When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
the user-side complement: proactive critical thinking trains the AI to detect missing information, but ASK shows users themselves cannot articulate what they lack; combining ASK detection (84% precision) with proactive questioning could intervene before topic drift compounds the underspecification
Why do AI agents miss most of what users actually want? UserBench explores why current models align with user intent only 20% of the time, even when users reveal preferences across multiple turns. The question examines whether agents can learn to actively clarify ambiguous or evolving goals.
UserBench quantifies the cost of absent proactive questioning: the <30% preference discovery rate confirms that current models lack the proactive critical thinking needed to surface underspecified user intents

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

proactive critical thinking enables models to identify missing information and actively request clarification rather than passively refusing or hallucinating answers

Can models learn to ask clarifying questions instead of guessing?

Inquiring lines that read this note 61

Related concepts in this collection 15

Related papers in this collection 8

Search by related questions 4