INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›How should models express uncertai…›this inquiring line

When an AI faces multiple unknowns, which single question should it ask to learn the most from your answer?

How can models select the optimal question to ask given multiple uncertainties?

This explores how a model decides which single question is worth asking when many things are unknown at once — not just whether to ask, but how to pick the most valuable question from many candidates.

This explores how a model decides which single question is worth asking when many things are unknown at once — picking the highest-value question, not just any clarifying prompt. The cleanest answer in the corpus is to make the model simulate the future: for each candidate question, imagine the possible answers a user might give, score how much each answer would shrink the model's uncertainty, and ask the question whose answers reduce uncertainty the most. That information-gain approach How can models select the most informative question to ask? turns 'ask a clarifying question' into an optimization problem with a principled objective, rather than a generic 'can you tell me more?' A close cousin appears in personalization, where active learning picks the questions that most sharpen an uncertain estimate of a user's preferences — and remarkably, about ten well-chosen questions are enough to pin down someone's reward coefficients Can user preferences be learned from just ten questions?. Both treat question selection as: choose the query that most collapses what you don't yet know.

But 'most informative' isn't the same as 'best.' A question can maximize information gain and still be vague, off-topic, or impossible to answer. One line of work decomposes question quality into separate attributes — clarity, relevance, specificity — and trains on each rather than on a single blended score, which matters most in high-stakes settings like clinical reasoning where the right clarifying question directly changes the decision Can models learn to ask genuinely useful clarifying questions?. So the full recipe is two-layered: information gain tells you *what to be uncertain about*, and attribute-level quality tells you *how to phrase the probe* so the answer is actually usable.

There's a prior question the corpus insists on: should the model ask at all? Standard RLHF quietly teaches models *not* to ask, because next-turn reward optimization rewards looking helpful right now over discovering what the user actually wants. Rewarding long-term interaction value instead flips this, letting models actively probe for intent Why do language models respond passively instead of asking clarifying questions?. The mirror-image failure is asking — or reasoning — when you shouldn't: models often grind out elaborate answers to questions with missing premises instead of flagging them as unanswerable, because training rewards producing reasoning steps but never teaches when to disengage Why do reasoning models overthink ill-posed questions?. Optimal question selection therefore sits between two cliffs: passively answering when it should clarify, and over-engaging when it should stop.

Underneath all of this is the model's sense of its own uncertainty, and the corpus is split on how to measure it. For a related decision — when to retrieve external information — calibrated token-probability uncertainty often beats elaborate multi-call heuristics at a fraction of the cost, suggesting a model's self-knowledge is a reliable signal Can simple uncertainty estimates beat complex adaptive retrieval?. Yet cheap *external* features of the question alone can rival uncertainty estimation, especially on hard questions Can question features alone predict when to retrieve?, and confidence itself is a usable signal — high confidence predicts robustness, low confidence predicts wild output swings Does model confidence predict robustness to prompt changes?. The catch: this whole machinery assumes models can represent uncertainty in the first place. Calibration ability exists but is undertrained — small models taught uncertainty-aware objectives and given the option to abstain can match models ten times larger Can models learn to abstain when uncertain about predictions?, and making abstention an explicitly learnable, rewarded action rather than a failure substantially cuts confident-but-wrong answers Can three-way rewards fix the accuracy versus abstention problem?.

The through-line you might not have expected: selecting the optimal question is really the same skill as deciding *whether to ask, retrieve, think harder, or abstain* — all of them are routing decisions driven by calibrated uncertainty. The corpus shows models can be trained to route between extended thinking and quick answers without difficulty labels Can models learn when to think versus respond quickly?, and even to hold several candidate solutions open at once by making their internal reasoning stochastic rather than committing early Can stochastic latent reasoning let models explore multiple solutions?. Asking the best question, in other words, is one face of a more general competence: knowing precisely what you don't know, and acting on it.

Sources 12 notes

How can models select the most informative question to ask?

UoT combines uncertainty-aware scenario simulation with information-gain scoring and reward propagation to identify questions whose possible answers maximally reduce diagnostic uncertainty—providing a principled mechanism for specific, high-value clarification rather than generic prompts.

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Can models learn to ask genuinely useful clarifying questions?

The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Show all 12 sources

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Can question features alone predict when to retrieve?

Learned predictors using 27 lightweight external question features match complex uncertainty-based methods on overall performance while costing far less, and outperform them on complex questions across 6 QA datasets.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can stochastic latent reasoning let models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent probability distributions over solutions rather than single points. This lets recursive reasoners maintain uncertainty, explore alternatives, and handle ambiguous or multi-solution problems that deterministic single-path designs cannot.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions2.49 match · arxiv ↗
Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home1.76 match · arxiv ↗
LLM-Independent Adaptive RAG: Let the Question Speak for Itself1.74 match · arxiv ↗
Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation1.71 match · arxiv ↗
TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning1.68 match · arxiv ↗
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity1.68 match · arxiv ↗
Deep Research: A Systematic Survey1.68 match · arxiv ↗
Reported Confidence in LLMs Tracks Commitment More Than Correctness1.67 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains: **How can models select the optimal question to ask given multiple uncertainties?** This is still open.

What a curated library found — and when (dated claims, not current truth): These findings span Feb 2024–May 2026.
• Information-gain simulation picks high-value clarifying questions by scoring possible answers' uncertainty reduction (2024-02, arXiv:2402.03271).
• ~10 well-chosen active-learning questions suffice to pin down user preference coefficients via reward factorization (2025-03, arXiv:2503.06358).
• Standard RLHF trains models NOT to ask; long-term interaction reward flips this (2026-02, arXiv:2602.07338).
• Decomposing question quality into clarity, relevance, specificity outperforms single blended scores in high-stakes domains (2025-02, arXiv:2502.14860).
• Calibrated token-probability uncertainty and external question features rival elaborate retrieval heuristics; abstention as explicit learnable action cuts confident-but-wrong outputs (2025-01, 2025-06, arXiv:2501.12835 & arXiv:2506.09038).

Anchor papers (verify; mind their dates):
• arXiv:2402.03271 (2024-02): Uncertainty-Aware Planning for Information Seeking
• arXiv:2502.14860 (2025-02): Aligning LLMs to Ask Good Questions in Clinical Reasoning
• arXiv:2602.07338 (2026-02): Intent Mismatch in Multi-Turn Conversation
• arXiv:2506.09038 (2025-06): AbstentionBench on Unanswerable Questions

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For information-gain selection: do newer training methods (e.g., DPO, preference-learning frameworks beyond RLHF) still require explicit uncertainty supervision, or do they infer it implicitly? Has the finding that ~10 questions suffice held as model scale or task complexity grew, or does it degrade? Does the RLHF penalty against asking persist in instruction-tuned or constitutional-AI models, or have newer alignment methods relaxed it?
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last 6 months. Does arXiv:2605.19376 (Generative Recursive Reasoning, 2026-05) or arXiv:2505.13379 (Thinkless, 2025-05) suggest that question selection itself is subsumed by learned "when-to-think" routing that bypasses explicit uncertainty modeling?
(3) **Propose 2 durable research questions** assuming the regime may have shifted:
   - Can models learn to ask the right question *without* simulating future answers — e.g., via direct gradient feedback on conversation outcomes rather than information gain?
   - Does the decomposition of question quality (clarity, relevance, specificity) remain necessary once models are trained on long-horizon reward that penalizes unanswerable or irrelevant clarifications?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When an AI faces multiple unknowns, which single question should it ask to learn the most from your answer?

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8