SYNTHESIS NOTE

Topics›Conversation Topics Dialog›this note

Can models learn to ask genuinely useful clarifying questions?

Explores whether question-asking quality is teachable through decomposing it into specific attributes like clarity and relevance, rather than treating it as a monolithic skill.

Synthesis note · 2026-02-22 · sourced from Conversation Topics Dialog

The ALFA (Aligning LLMs to Ask) framework addresses a specific capability gap: LLMs fail to ask effective questions under uncertainty, making them unreliable in domains where proactive information-gathering is essential for decision-making.

The framework has three components:

Decompose — break down "good question" into theory-grounded attributes (e.g., clarity, relevance, specificity)
Synthesize — controllably generate attribute-specific question variations (80K preference pairs)
Align — preference-based optimization to learn asking better questions along fine-grained attributes

Applied to clinical reasoning using the MediQ-AskDocs dataset (17K real-world clinical interactions), ALFA demonstrates that question quality is not unitary — a question can be clear but irrelevant, or relevant but ambiguous. Decomposing quality into attributes and training against each one produces better overall question-asking than optimizing for a single "question quality" score.

The clinical domain makes the stakes concrete: a doctor who asks the wrong clarifying question may miss a critical symptom. Models that excel at static medical QA benchmarks still fail at the interactive task of gathering missing information through conversation. Since Can models learn to ask clarifying questions instead of guessing?, ALFA provides the methodology for making those clarifying questions actually good — not just present.

This connects to the broader clarification design finding. Since Which clarifying questions actually improve user satisfaction?, the attribute decomposition explains why: a question high on specificity and relevance but low on verbosity will outperform one that merely paraphrases the user's need. Attribute-specific training can target exactly the dimensions that matter.

PerQs provides practical validation of attribute-based question quality at scale. The Active Listening system populates prompt templates with 400+ real user interests (aggregated from ~39K anonymous user models) and generates personalized Q&A pairs (~19K total) via LLM. Deployed in Alexa Prize, personalized questions showed significant positive effects on perceived conversation quality. The interest-personalization dimension demonstrates that "good questions" are not just structurally well-formed (ALFA's clarity, relevance, specificity attributes) but also content-aligned with user interests — a dimension that attribute-specific training could incorporate as an additional quality axis.

Inquiring lines that read this note 69

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can recommendation systems balance personalization with stability and coverage?

How do attribute-asking strategies depend on current confidence in candidate items?

What makes specific clarifying questions more effective than generic ones?

Why do benchmark improvements fail to reflect actual reasoning quality?

Could AI assessment quality differ across subjects or question formats?

Why do LLM chatbots fail as independent therapeutic agents?

How can models identify insufficient information and respond appropriately without guessing?

Why do language models reinforce false assumptions instead of correcting them?

What dimensions of recommendation quality do standard metrics miss?

What measurement artifacts emerge when annotators interpret the same question differently?

How should personalization be implemented to improve AI assistant effectiveness?

Can personalized questions improve conversation quality in open-domain chat?

How do training data properties shape reasoning capability development?

How do training priors constrain what context information can override?

What makes dialogue-based explanation more successful than monologue?

How can language models sustain linguistic synchrony and intersubjectivity during dialogue?

How can emotions function as reliable information in reasoning and cognitive systems?

Can language models understand the implicit emotional intent behind questions?

Can AI systems balance emotional competence with factual reliability?

Does current empathetic AI misalign with how humans actually ask questions?

How should conversational agents balance goal-driven initiative with user control?

Can ensemble evaluation methods reduce bias more than single judges?

Why do correct reasoning traces tend to be shorter than incorrect ones?

How does random walk length control reasoning complexity in question generation?

How do language models establish social grounding in human dialogue?

Can static word-sharing create genuine communicative grounding between humans and models?

What properties determine whether reward signals teach genuine reasoning?

Why do multi-turn conversations degrade AI intent and coherence?

Why do weaker language models fail at multi-turn strategic questioning?

How do social dynamics and selection effects compound in rating aggregates?

Why do more detailed rating systems sometimes improve learning from reviews?

Does reinforcement learning teach reasoning or just when to reason?

What makes weaker teacher models effective for stronger student training?

What filtering criteria best identify student-compatible refinements from teacher models?

Does RLHF training sacrifice accuracy and grounding for user agreement?

Why do reward structures fail to shape long-term agent learning?

Can AI learn intrinsic motivation to assess its own relevance?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Do scheme critical questions work better than direct scheme classification prompts?

How can AI alignment serve diverse human preferences at scale?

How much does forcing single-choice answers damage alignment with complex intent?

How does example difficulty affect learning efficiency in language models?

Why do explicit quality criteria outperform learning quality from examples alone?

Can model confidence signals reliably improve reasoning quality and calibration?

Can thought quality alone be trusted to guide model training?

Can prompting inject entirely new knowledge into language models?

Can structured questioning prompts improve reasoning beyond standard conversational training?

How should models express uncertainty rather than forced confident answers?

How can models select the optimal question to ask given multiple uncertainties?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 104 in 2-hop network ·medium cluster Open in graph ↗

Can models learn to ask genuinely useful clarify… Can models learn to ask clarifying questions inste… Which clarifying questions actually improve user s… Can models identify what information they actually… What makes strategic question-asking succeed or fa…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models learn to ask clarifying questions instead of guessing? Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
ALFA provides the quality methodology for the proactive questioning capability
Which clarifying questions actually improve user satisfaction? Not all clarification helps equally. This explores whether asking users to rephrase their needs works as well as asking targeted questions about specific information gaps.
attribute decomposition explains why specific questions outperform rephrasing
Can models identify what information they actually need? When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
ALFA directly trains the missing-information identification + question-asking capability
What makes strategic question-asking succeed or fail? Explores whether excellent performance at multi-turn questioning requires one dominant skill or the coordinated interaction of multiple distinct capabilities. Matters because many real-world tasks (diagnosis, troubleshooting, clarification) depend on this ability.
20Q reveals the three capabilities strategic questioning requires; ALFA's attribute-specific training directly shapes the planning component (question efficiency, specificity)

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

training models to ask good questions requires decomposing quality into theory-grounded attributes and aligning via attribute-specific preference optimization

Can models learn to ask genuinely useful clarifying questions?

Inquiring lines that read this note 69

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4