Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning
Large language models (LLMs) often fail to ask effective questions under uncertainty, making them unreliable in domains where proactive information-gathering is essential for decisionmaking. We present ALFA, a framework that improves LLM question-asking by (i) decomposing the notion of a “good” question into a set of theory-grounded attributes (e.g., clarity, relevance), (ii) controllably synthesizing attribute-specific question variations, and (iii) aligning models via preference-based optimization to explicitly learn to ask better questions along these fine-grained attributes. Focusing on clinical reasoning as a case study, we introduce the MediQ-AskDocs dataset, composed of 17k real-world clinical interactions augmented with 80k attribute-specific preference pairs of follow-up questions, as well as a novel expert-annotated interactive healthcare QA task to evaluate question-asking abilities. Models aligned with ALFA reduce diagnostic errors by 56.6% on MediQ-AskDocs compared to SOTA instruction-tuned LLMs, with a question-level win-rate of 64.4% and strong generalizability. Our findings suggest that explicitly guiding question-asking with structured, fine-grained attributes offers a scalable path to improve LLMs, especially in expert application domains.1
Introduction. Interactive language models have demonstrated remarkable capabilities across numerous domains (OpenAI et al., 2024), yet proactive interaction abilities in high-stakes scenarios—clinical reasoning, legal analysis, investigative journalism—remains a challenge (Fung et al., 2024). A key obstacle is the ability of these models to recognize and anticipate missing or ambiguous information and proactively seek clarification (Li et al., 2024; Deng et al., 2024). In clinical practice, for instance, physicians systematically ask patients questions to rule out or confirm relevant diagnoses (Richardson et al., 1995; Proffit, 2013). This iterative, information-seeking behavior is essential for accurate and safe decision-making. Similarly, for large language models (LLMs) to serve as reliable decision-support tools for clinicians, they must learn not only to provide answers, but also to identify when additional information is needed, and to ask follow-up questions that effectively reduce uncertainty (Figure 1).
Discussion / Conclusion. Effective question-asking is a fundamental yet underdeveloped capability in large language models, particularly in high-stakes domains like clinical reasoning. We proposed ALFA, a framework that explicitly teaches models to ask better questions by decomposing question quality into theorygrounded, fine-grained attributes and aligning them through preference-based optimization, rather than treating such nuanced and complex goal as a monolithic objective. We introduced MediQ-AskDocs, a comprehensive dataset of training data, preference data, and a healthcare QA task, showing that models trained with ALFA substantially outperform baselines. While focused on medicine as a case study, ALFA is a general recipe adaptable to any field where clear, targeted questioning is essential, paving the way for interactive and reliable systems.