Why do models fail at asking good questions during interaction?

When models must actively seek information through questions rather than receive it passively, they struggle dramatically. This explores why GPT-4o plateaus at 35% accuracy and whether training or prompting can fix the underlying deficit.

Synthesis note · 2026-04-18 · sourced from Reasoning Methods CoT ToT

AR-Bench introduces a critical distinction: passive reasoning (all information given, solve the problem) versus active reasoning (information must be sought through interaction). This distinction exposes a capability gap that standard benchmarks completely miss.

The results are stark. On number guessing — a task with well-defined information-theoretic structure — GPT-4o achieves only 35%. The information gain curve reveals why: models extract 7.7% information gain in rounds 5-10, but this drops to just 2.5% in rounds 20-25. More interaction does not proportionally reduce uncertainty. The models plateau because they cannot formulate increasingly precise questions — they ask vague, repetitive queries that fail to efficiently partition the remaining hypothesis space.

What makes this finding particularly damaging is the intervention analysis. SFT, DPO, Tree-of-Thought, human-written instructions, Proactive CoT, and Uncertainty-of-Thought (UoT) all provide minimal benefit. The active reasoning deficit is not a prompting problem or a fine-tuning problem — it appears to be a structural limitation in how current models represent and reduce uncertainty through sequential interaction.

This connects directly to Can models identify what information they actually need?, which showed models cannot identify what information is missing even when they can solve the fully-specified version. AR-Bench extends this from identification to acquisition: even when the model has the opportunity to ask questions, it cannot formulate effective ones. The deficit spans the full pipeline — detection, formulation, and iterative refinement of information needs.

The connection to Why do RL agents stop asking informative questions? is structural: both describe systems that fail to escape low-information states. Self-locking describes the mechanism (weak belief tracking creates a trap); AR-Bench measures the behavioral consequence (plateau in information gain despite continued interaction).

The early plateau pattern also resonates with Does more thinking time always improve reasoning accuracy? — both reveal non-monotonic returns to continued processing, whether through more thinking tokens or more interaction rounds. The mechanism differs (overthinking vs. question quality degradation) but the failure mode is analogous: more compute/interaction without better strategy yields diminishing or negative returns.

Since Can models learn to ask clarifying questions instead of guessing?, the AR-Bench results suggest that even proactive critical thinking may be insufficient — the bottleneck is not willingness to ask but ability to ask well.

Inquiring lines that read this note 5

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can AI systems learn from failures without cascading errors?

When is GPT model interpretation most likely to diverge from user intent?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How can a model explain something correctly yet fail to apply it?

How can models identify insufficient information and respond appropriately without guessing?

Why do models struggle with asking questions in multi-turn conversational reasoning tasks?

What capability tradeoffs emerge when scaling model reasoning abilities?

Why do strong models struggle more with instruction following than mid-tier ones?

Do base models contain latent reasoning that training can unlock?

What makes a model fail to activate relevant skills from its own harness?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 144 in 2-hop network ·medium cluster Open in graph ↗

Why do models fail at asking good questions duri… Can models identify what information they actually… Why do RL agents stop asking informative questions… Does more thinking time always improve reasoning a… Can models learn to ask clarifying questions inste… Why do language models fail in gradually revealed …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models identify what information they actually need? When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
AR-Bench extends from identifying missing info to acquiring it through interaction; both capabilities are deficient
Why do RL agents stop asking informative questions? RL-trained agents often fail to seek information effectively, despite being trained to do so. Understanding whether this reflects a capability gap or a training dynamics problem could reveal how to unlock better information-seeking behavior.
structural parallel: both describe failure to escape low-information states
Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
analogous plateau: more interaction rounds, like more thinking tokens, yield diminishing returns without better strategy
Can models learn to ask clarifying questions instead of guessing? Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
AR-Bench challenges whether proactive asking is sufficient; question quality, not willingness, is the bottleneck
Why do language models fail in gradually revealed conversations? Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
related multi-turn failure: premature assumptions prevent effective information gathering

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

active reasoning through interaction is dramatically harder than passive reasoning — models plateau early and ask vague repetitive questions

Why do models fail at asking good questions during interaction?

Inquiring lines that read this note 5

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4