SYNTHESIS NOTE
Conversational AI and Personalization Reasoning, Retrieval, and Evaluation Psychology, Society, and Alignment

Can models learn to ask genuinely useful clarifying questions?

Explores whether question-asking quality is teachable through decomposing it into specific attributes like clarity and relevance, rather than treating it as a monolithic skill.

Synthesis note · 2026-02-22 · sourced from Conversation Topics Dialog
Why do AI conversations reliably break down after multiple turns? How should researchers navigate LLM reasoning research?

The ALFA (Aligning LLMs to Ask) framework addresses a specific capability gap: LLMs fail to ask effective questions under uncertainty, making them unreliable in domains where proactive information-gathering is essential for decision-making.

The framework has three components:

  1. Decompose — break down "good question" into theory-grounded attributes (e.g., clarity, relevance, specificity)
  2. Synthesize — controllably generate attribute-specific question variations (80K preference pairs)
  3. Align — preference-based optimization to learn asking better questions along fine-grained attributes

Applied to clinical reasoning using the MediQ-AskDocs dataset (17K real-world clinical interactions), ALFA demonstrates that question quality is not unitary — a question can be clear but irrelevant, or relevant but ambiguous. Decomposing quality into attributes and training against each one produces better overall question-asking than optimizing for a single "question quality" score.

The clinical domain makes the stakes concrete: a doctor who asks the wrong clarifying question may miss a critical symptom. Models that excel at static medical QA benchmarks still fail at the interactive task of gathering missing information through conversation. Since Can models learn to ask clarifying questions instead of guessing?, ALFA provides the methodology for making those clarifying questions actually good — not just present.

This connects to the broader clarification design finding. Since Which clarifying questions actually improve user satisfaction?, the attribute decomposition explains why: a question high on specificity and relevance but low on verbosity will outperform one that merely paraphrases the user's need. Attribute-specific training can target exactly the dimensions that matter.

PerQs provides practical validation of attribute-based question quality at scale. The Active Listening system populates prompt templates with 400+ real user interests (aggregated from ~39K anonymous user models) and generates personalized Q&A pairs (~19K total) via LLM. Deployed in Alexa Prize, personalized questions showed significant positive effects on perceived conversation quality. The interest-personalization dimension demonstrates that "good questions" are not just structurally well-formed (ALFA's clarity, relevance, specificity attributes) but also content-aligned with user interests — a dimension that attribute-specific training could incorporate as an additional quality axis.

Inquiring lines that use this note as a source 69

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 102 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

training models to ask good questions requires decomposing quality into theory-grounded attributes and aligning via attribute-specific preference optimization