INQUIRING LINE

Why do models struggle with asking questions in multi-turn conversational reasoning tasks?

This explores why LLMs are bad at the active move of asking questions mid-conversation — not just answering, but pausing to find out what they don't yet know — and whether that's a training artifact or something deeper.


This explores why models struggle to ask questions during multi-turn conversations, and the corpus points to a clear culprit before any deeper mystery: how they were trained. Standard RLHF rewards immediate helpfulness, so models learn to answer now rather than discover what you actually want. Why do language models respond passively instead of asking clarifying questions? frames this directly — next-turn rewards train passive responding, and only rewards that estimate the long-term value of an interaction push a model to actively probe for intent. The flip side appears in Why do language models lose performance in longer conversations?, which argues that the much-discussed drop in multi-turn performance isn't lost capability at all — it's an intent-alignment gap created by training that rewards premature answers over asking.

What makes this expensive is that the early guess is sticky. Why do language models fail in gradually revealed conversations? shows, across 200,000+ conversations, a 39% average performance drop in multi-turn settings because models lock onto an incorrect assumption when information is revealed gradually — and then can't recover from it. So the failure to ask isn't a small politeness gap; it's the moment the whole conversation goes wrong.

The encouraging news is that asking is learnable. Can models learn to ask clarifying questions instead of guessing? reports RL training lifting proactive question-asking on flawed problems from near-zero to ~74% — but also that the skill is fragile, and that inference-time scaling actually degraded it in untrained models. Can models learn to ask clarifying questions without explicit training? gets there a different way: train on complete problems and the behavior of asking for missing pieces emerges on its own. And asking *well* is its own subproblem — Can models learn to ask genuinely useful clarifying questions? breaks question quality into attributes like clarity, relevance, and specificity, because a vague clarifying question is barely better than none.

Here's the part you might not expect: even with all that training, something structural may remain. Why do models fail at asking good questions during interaction? tests models on interactive number-guessing and finds GPT-4o scoring only 35%, with information gain collapsing as rounds progress — and SFT, DPO, and Tree-of-Thought all barely move the needle. That suggests reasoning *through* interaction (deciding what to ask to maximize what you learn) is a genuinely different and harder capability than reasoning over information you're handed. A related thread, Why do reasoning models overthink ill-posed questions?, shows reasoning models will churn out long answers to unanswerable questions instead of stopping to say a premise is missing — they were optimized to produce reasoning steps, never to disengage or interrogate the prompt.

So the answer is layered: training teaches models to answer rather than ask, the first wrong guess compounds, and underneath sits a deeper gap in reasoning-by-interaction that fine-tuning only partly closes. The adjacent lesson worth carrying away — from Why do language models engage with conversational distractors? — is that models reliably learn what-to-do instructions but not what-to-ignore or what-to-question instructions; the missing skill is almost always an absent training signal, not absent capacity.


Sources 9 notes

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Why do language models lose performance in longer conversations?

LLMs degrade in multi-turn settings because RLHF training rewards premature answers over clarification-seeking, creating pragmatic mismatch with individual user behaviors. A Mediator-Assistant architecture that explicitly parses user intent before execution recovers lost performance without retraining.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Can models learn to ask clarifying questions instead of guessing?

Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.

Can models learn to ask clarifying questions without explicit training?

Models trained via SML on complete problems generalize to underspecified tasks by asking for needed information and delaying answers. The training paradigm instills a meta-strategy of using conversation as an information source, addressing the premature-answering failure mode.

Can models learn to ask genuinely useful clarifying questions?

The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.

Why do models fail at asking good questions during interaction?

GPT-4o achieves only 35% on interactive number guessing, with information gains collapsing from 7.7% to 2.5% as rounds progress. SFT, DPO, and Tree-of-Thought interventions provide minimal improvement, suggesting the deficit is structural rather than a prompting or fine-tuning problem.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Why do language models engage with conversational distractors?

Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: why do LLMs struggle to ask clarifying questions during multi-turn reasoning, and what training or inference changes might unlock this capability?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. Key constraints documented:
- Standard RLHF rewards immediate answers over intent-probing; next-turn rewards alone don't shift this (2025–2026).
- Models lock onto premature assumptions early in conversations; 39% average multi-turn performance drop across 200k+ conversations, with recovery rarely happening (2025).
- Proactive question-asking lifted from ~0% to ~74% via RL on flawed problems, but the skill is fragile; inference-time scaling degraded it in untrained models (2025).
- Interactive reasoning (deciding what to ask to learn most) yields only ~35% accuracy in GPT-4o on number-guessing tasks; SFT, DPO, Tree-of-Thought barely move it (2025).
- Models optimized for reasoning steps produce long answers to unanswerable questions instead of flagging missing premises; interrogating the prompt was never a training signal (2026).

Anchor papers (verify; mind their dates):
- 2505.06120 (LLMs Get Lost In Multi-Turn Conversation)
- 2507.23407 (Beyond Passive Critical Thinking: Fostering Proactive Questioning)
- 2602.07338 (Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation)
- 2602.06176 (Large Language Model Reasoning Failures)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o3, Claude 4, specialized dialogue agents), training methods (DPO variants, process reward models, multi-agent orchestration), or evaluation harnesses have since RELAXED or OVERTURNED it. Separate the durable question (e.g., "Do models naturally learn to ask?") from the perishable limitation (e.g., "RLHF on next-turn reward alone won't do it"). Where a constraint still holds, say so plainly; where it's been relaxed, cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially studies showing models *do* ask well under conditions these papers didn't test.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Does process-level reward for information gain (not answer quality) flip question-asking behavior?" or "Can multi-agent dialogue (human + model + reasoner) emerge questioning as an equilibrium?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines