Why do models fail at asking good questions during interaction?
When models must actively seek information through questions rather than receive it passively, they struggle dramatically. This explores why GPT-4o plateaus at 35% accuracy and whether training or prompting can fix the underlying deficit.
AR-Bench introduces a critical distinction: passive reasoning (all information given, solve the problem) versus active reasoning (information must be sought through interaction). This distinction exposes a capability gap that standard benchmarks completely miss.
The results are stark. On number guessing — a task with well-defined information-theoretic structure — GPT-4o achieves only 35%. The information gain curve reveals why: models extract 7.7% information gain in rounds 5-10, but this drops to just 2.5% in rounds 20-25. More interaction does not proportionally reduce uncertainty. The models plateau because they cannot formulate increasingly precise questions — they ask vague, repetitive queries that fail to efficiently partition the remaining hypothesis space.
What makes this finding particularly damaging is the intervention analysis. SFT, DPO, Tree-of-Thought, human-written instructions, Proactive CoT, and Uncertainty-of-Thought (UoT) all provide minimal benefit. The active reasoning deficit is not a prompting problem or a fine-tuning problem — it appears to be a structural limitation in how current models represent and reduce uncertainty through sequential interaction.
This connects directly to Can models identify what information they actually need?, which showed models cannot identify what information is missing even when they can solve the fully-specified version. AR-Bench extends this from identification to acquisition: even when the model has the opportunity to ask questions, it cannot formulate effective ones. The deficit spans the full pipeline — detection, formulation, and iterative refinement of information needs.
The connection to Why do RL agents stop asking informative questions? is structural: both describe systems that fail to escape low-information states. Self-locking describes the mechanism (weak belief tracking creates a trap); AR-Bench measures the behavioral consequence (plateau in information gain despite continued interaction).
The early plateau pattern also resonates with Does more thinking time always improve reasoning accuracy? — both reveal non-monotonic returns to continued processing, whether through more thinking tokens or more interaction rounds. The mechanism differs (overthinking vs. question quality degradation) but the failure mode is analogous: more compute/interaction without better strategy yields diminishing or negative returns.
Since Can models learn to ask clarifying questions instead of guessing?, the AR-Bench results suggest that even proactive critical thinking may be insufficient — the bottleneck is not willingness to ask but ability to ask well.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- When is GPT model interpretation most likely to diverge from user intent?
- How can a model explain something correctly yet fail to apply it?
- Why do models struggle with asking questions in multi-turn conversational reasoning tasks?
- Why do strong models struggle more with instruction following than mid-tier ones?
- What makes a model fail to activate relevant skills from its own harness?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models identify what information they actually need?
When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
AR-Bench extends from identifying missing info to acquiring it through interaction; both capabilities are deficient
-
Why do RL agents stop asking informative questions?
RL-trained agents often fail to seek information effectively, despite being trained to do so. Understanding whether this reflects a capability gap or a training dynamics problem could reveal how to unlock better information-seeking behavior.
structural parallel: both describe failure to escape low-information states
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
analogous plateau: more interaction rounds, like more thinking tokens, yield diminishing returns without better strategy
-
Can models learn to ask clarifying questions instead of guessing?
Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
AR-Bench challenges whether proactive asking is sufficient; question quality, not willingness, is the bottleneck
-
Why do language models fail in gradually revealed conversations?
Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
related multi-turn failure: premature assumptions prevent effective information gathering
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Evaluating Large Language Models in Theory of Mind Tasks
- Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration
- Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions
- Large Language Models Think Too Fast To Explore Effectively
- From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
- The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
Original note title
active reasoning through interaction is dramatically harder than passive reasoning — models plateau early and ask vague repetitive questions