INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›How can models identify insufficie…›this inquiring line

An AI can ace hard problems yet be completely blind to the fact that a key piece of information is missing.

How can agents detect missing information before attempting to solve problems?

This explores whether AI agents can recognize that a problem is under-specified — that they're missing a key fact — *before* they charge ahead and solve it, rather than confidently answering the wrong question.

This explores whether agents can notice missing information before solving, not after. The most striking finding in the corpus is that these are two different skills that don't come bundled. A model can ace a fully-specified reasoning problem and still fall apart when one variable is quietly withheld — accuracy on identifying the right clarifying question to ask drops to 40-50%, because gathering information and executing a solution are separable cognitive operations Can models identify what information they actually need?. So 'good at problems' does not imply 'good at noticing what's absent from the problem.'

The encouraging news is that the missing-information radar can be trained. With reinforcement learning, models went from catching almost nothing — 0.15% — to flagging deliberately flawed math problems 73.98% of the time, proactively identifying gaps and requesting clarification instead of guessing Can models learn to ask clarifying questions instead of guessing?. But the capability is fragile: in untrained models, giving them more inference-time 'thinking' actually made them worse at spotting flaws, and only after RL did extra compute help. The radar has to be explicitly built, not assumed.

Why is it missing by default? Because agents are passive by design. Optimizing for the next-turn reward structurally strips out initiative — the model is rewarded for producing an answer now, not for pausing to say 'wait, I don't have enough to go on' Why do AI agents fail to take initiative?. Detecting a gap is itself an act of initiative, which is why the harder engineering question becomes *when* to interrupt without being a nuisance. Conversation analysis offers a surprisingly concrete framework here: 'insert-expansions,' the little clarifying sub-dialogues humans use mid-conversation, formalize the moments an agent should probe the user rather than silently chain tools toward a misread goal When should AI agents ask users instead of just searching?.

There's a deeper reframing worth noticing. Detecting missing information isn't only a front-door problem ('do I have enough to start?') — it's a continuous one that runs through the whole task. Checking intermediate reasoning states rather than just final answers raised task success from 32% to 87%, because most failures are process violations that surface mid-trace, not wrong final answers Where do reasoning agents actually fail during long traces?. In other words, 'what am I missing?' is a question worth asking at every step, not just step zero. And when the agent genuinely can't resolve a gap alone, the human-agent collaboration literature is blunt that there's no clean rule for *when* to defer — instead of solving the timing problem, systems distribute the judgment across multiple touchpoints like co-planning, action guards, and verification When should human-agent systems ask for human help?.

The thread that ties these together: gap-detection isn't a property of a bigger model, it's something you have to architect in — through training, through externalized verification and memory structures, through interaction designs that make asking cheap Where does agent reliability actually come from?. The interesting corner the corpus leaves you with is that an agent's biggest blind spot may be the things it was never shown: trained only on tidy expert demonstrations, it inherits the curator's imagination and never learns to recognize the messy, under-specified situations that fall outside it Can agents learn beyond what their training data shows?.

Sources 8 notes

Can models identify what information they actually need?

Models achieving high accuracy on complete reasoning tasks drop to 40-50% accuracy identifying what clarifying question to ask when one variable is withheld. Information gathering and problem execution are separable cognitive operations.

Can models learn to ask clarifying questions instead of guessing?

Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.

Why do AI agents fail to take initiative?

Research shows next-turn reward optimization structurally removes initiative from models, but proactive behaviors like critical thinking and clarification-seeking are trainable (0.15% to 73.98% with RL). The core challenge is balancing proactivity with civility to avoid intrusion.

When should AI agents ask users instead of just searching?

Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Show all 8 sources

When should human-agent systems ask for human help?

Magentic-UI identifies co-planning, co-tasking, action guards, verification, memory, and multitasking as mechanisms that work around the lack of ground truth for optimal deferral timing. Rather than solving the timing problem directly, these mechanisms distribute decision-making across multiple touchpoints.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Proactive Conversational Agents in the Post-ChatGPT World2.52 match · arxiv ↗
Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration1.70 match · arxiv ↗
DiscussLLM: Teaching Large Language Models When to Speak1.69 match · arxiv ↗
Can Large Language Models Reason and Optimize Under Constraints?1.69 match · arxiv ↗
A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap1.65 match · arxiv ↗
LIMI: Less is More for Agency1.61 match · arxiv ↗
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering0.88 match · arxiv ↗
Insert-expansions For Tool-enabled Conversational Agents0.87 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher tasked with re-evaluating whether agents can proactively detect missing information before solving problems. This remains an open question despite recent progress.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable snapshots:
• Solving well-specified problems ≠ spotting missing information: accuracy on identifying clarifying questions drops to 40–50%, even for models strong on reasoning (2025).
• RL training moves gap-detection from ~0.15% to 73.98% on flawed math problems, but only if explicitly trained; untrained models worsen with extra inference time (2025).
• 'Insert-expansions' from conversation analysis formalize when agents should probe users mid-task rather than silently chain tools (2023).
• Continuous verification of intermediate reasoning states (not just final answers) raises task success from 32% to 87%, revealing gaps are process violations, not output errors (2025–2026).
• Externalized memory, skills, and interaction protocols distribute gap-detection across multiple touchpoints; no single rule governs *when* to defer (2026).

Anchor papers (verify; mind their dates):
• arXiv:2307.01644 (2023): Insert-expansions for conversational agents.
• arXiv:2503.22674 (2025): QuestBench on asking the right clarifying questions.
• arXiv:2507.23407 (2025): Proactive questioning in human-AI collaboration.
• arXiv:2604.08224 (2026): Externalization in LLM agents (memory, skills, protocols).

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding, judge whether newer models (o3, o4 variants), RL methods (DPO, process reward models), orchestration (multi-agent verifiers, code harnesses as agent scaffolds), or evaluation benchmarks have since relaxed or overturned it. Separate the durable question ('can agents proactively detect gaps?') from perishable limitations ('only RL can do it'). Cite what resolved each.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from late 2025–2026 that challenges the 'gap-detection must be trained' thesis or proposes alternative architectures (e.g., emergent questioning in scale, compositional verification).
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., 'Do process reward models learn to surface missing information without RL on clarification tasks?' or 'Can agent code harnesses expose gaps through type checking before execution?'

Cite arXiv IDs; flag anything you cannot ground in a real paper.

An AI can ace hard problems yet be completely blind to the fact that a key piece of information is missing.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8