Do models naturally learn to ask clarifying questions without explicit supervision?
This explores whether asking clarifying questions emerges on its own from ordinary training, or whether it has to be deliberately taught — and what kinds of training make it appear.
This explores whether the instinct to stop and ask — rather than guess — shows up naturally in models, or whether it has to be engineered in. The short version the corpus points to: left to standard training, models don't ask. They actively learn *not* to. The interesting part is that the capability can emerge without anyone explicitly labeling "good questions" — but only under training setups built to reward it.
The default pulls the wrong way. Conventional RLHF optimizes for being helpful *right now*, on the current turn, which quietly punishes the model for pausing to ask anything — answering immediately scores better than admitting it needs more information Why do language models respond passively instead of asking clarifying questions?. The same passivity shows up in reasoning models, which will grind out long answers to questions that are missing a premise instead of flagging them as unanswerable; training taught them to *produce reasoning steps* but never taught them *when to disengage* Why do reasoning models overthink ill-posed questions?. So the absence of clarifying behavior isn't a neutral gap — it's something standard objectives select against.
But "without explicit supervision" turns out to have a surprising answer: yes, the behavior can emerge — if you change *what* the model is trained on rather than hand-labeling questions. Social meta-learning trains models only on fully-specified problems, yet they generalize to underspecified ones by asking for what's missing and delaying their answer. Nobody supervised the questions; the model learned a meta-strategy of treating conversation itself as a source of information Can models learn to ask clarifying questions without explicit training? Can LLMs learn to ask for feedback during problem solving?. STaR-GATE goes further by self-play: a model finetunes on its own questions that happen to improve its answers, beating the base model 72% of the time after two iterations — preference elicitation turns out to be trainable without any human writing the questions Can models learn to ask better clarifying questions through self-improvement?. This is the same family as broader unsupervised self-improvement loops, where a proposer/challenger and a judge manufacture their own training signal with no human labels Can language models improve themselves without any external training data? Can language models learn skills without human supervision?.
The catch — and this is the thing worth knowing — is that the capability is *learnable but fragile*. One study pushed proactive identification of missing information from 0.15% to nearly 74% with reinforcement learning, but found that inference-time scaling (letting the model think longer) actually *degraded* the behavior in untrained models and only helped after the RL training was in place Can models learn to ask clarifying questions instead of guessing?. So thinking harder doesn't make a model ask better questions; it makes an untrained model rationalize a guess more elaborately. And when explicit teaching *is* used, decomposing "a good question" into named attributes — clarity, relevance, specificity — beats training on a single quality score, especially in high-stakes settings like clinical reasoning Can models learn to ask genuinely useful clarifying questions?.
The deeper reason this can't be left to chance connects to a hard ceiling elsewhere in the corpus: prompting and prompt optimization can only reorganize what a model already knows, never inject a missing capability Can prompt optimization teach models knowledge they lack?. Asking clarifying questions is a *behavioral disposition*, not a stored fact — so you can't prompt your way to it if training drove it out. The honest synthesis: models do not naturally learn to ask. But you don't need to supervise the questions themselves to get the behavior — you need to supervise the *incentive*, by training on underspecification or rewarding long-horizon interaction rather than next-turn helpfulness.
Sources 10 notes
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
Models trained via SML on complete problems generalize to underspecified tasks by asking for needed information and delaying answers. The training paradigm instills a meta-strategy of using conversation as an information source, addressing the premature-answering failure mode.
Research shows that reformulating static tasks as pedagogical dialogues—where a teacher has privileged information and the student must learn to extract it—trains models to actively engage conversation as a problem-solving tool, not just imitate dialogue patterns.
STaR-GATE iteratively finetunes a model on questions that increase response quality, achieving 72% preference over the base model after two iterations. The research shows preference elicitation is trainable through self-play without human question supervision.
SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.
Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.
Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.
The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.