INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›How can models identify insufficie…›this inquiring line

An AI can ace complex problems yet fail to notice when key information is missing — they turn out to be completely separate skills.

How does proactive critical thinking detect when information is incomplete?

This explores what's actually happening inside a model when it notices a problem is missing information — and why spotting that gap turns out to be a separate skill from solving the problem itself.

This explores how a model detects that information is incomplete — and the corpus's most useful surprise is that this is a *distinct cognitive operation* from solving problems, not a byproduct of being good at them. The clearest evidence: models that ace fully-specified reasoning tasks crater to 40–50% accuracy the moment one variable is withheld and they have to figure out *which* clarifying question to ask Can models identify what information they actually need?. Knowing how to execute a solution and knowing what you're missing are separable abilities — which is why "detecting incompleteness" needs its own training rather than riding along on raw capability.

When that detection is trained directly, the gains are dramatic but fragile. Reinforcement learning lifted proactive critical-thinking accuracy on deliberately broken math problems from a near-zero 0.15% to 74% Can models learn to ask clarifying questions instead of guessing?. Tellingly, just giving an untrained model more thinking time *hurt* its ability to notice flaws — the capability is learnable but doesn't emerge for free from scale or deliberation. This connects to a darker finding: training models to reason step-by-step can actually *narrow* this skill, teaching them to grind out an answer to an ill-posed question instead of disengaging and flagging that it can't be solved What critical thinking skills do reasoning models actually lose?. More reasoning isn't the same as better gap-detection.

The more interesting question is *how* a model knows a gap exists at all — and here the corpus points to several different internal signals. One is confidence: variance and overconfidence patterns can be read as diagnostic cues that a model is exploring fruitlessly (underthinking) or spinning in place (overthinking) Can confidence patterns reveal overthinking versus underthinking?, and calibrated uncertainty can be trained into a model so it learns to *abstain* when it doesn't know enough rather than guess Can models learn to abstain when uncertain about predictions?. Another, less obvious signal is the model's own partial answer: ITER-RETGEN found that a generated response surfaces information needs the original query couldn't even express — the half-formed answer reveals the hole better than the question did Can a model's partial response guide what to retrieve next?.

There's also a temporal dimension to detection. Checking *intermediate* states during a long reasoning trace catches failures that scoring the final answer misses entirely — raising task success from 32% to 87%, because most failures are process violations, not wrong conclusions Where do reasoning agents actually fail during long traces?. And mining the model's intermediate reasoning points, rather than its final commitment, produces more accurate answers because it surfaces alternatives before premature closure hides them Can intermediate reasoning points yield better answers than final ones?. The thread running through both: incompleteness is most detectable mid-stream, before the model has talked itself into a tidy but unsupported answer.

The takeaway you didn't know you wanted: "detecting missing information" isn't one mechanism but a family of them — separable from problem-solving, sometimes destroyed by reasoning training, and best read from confidence signals, abstention behavior, partial answers, and intermediate states rather than from the polished final output. The corpus is light on a unified theory of *how* these signals combine, but it's emphatic that the skill has to be cultivated deliberately and that fluent reasoning can actively mask it.

Sources 8 notes

Can models identify what information they actually need?

Models achieving high accuracy on complete reasoning tasks drop to 40-50% accuracy identifying what clarifying question to ask when one variable is withheld. Information gathering and problem execution are separable cognitive operations.

Can models learn to ask clarifying questions instead of guessing?

Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.

What critical thinking skills do reasoning models actually lose?

Models trained for step-by-step reasoning excel at in-distribution logical tasks but lose critical abilities: they overthink ill-posed questions instead of disengaging, and reason their way to wrong rules on inductive tasks. This cognitive narrowing is partly reversible through targeted RL training.

Can confidence patterns reveal overthinking versus underthinking?

ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Show all 8 sources

Can a model's partial response guide what to retrieve next?

ITER-RETGEN shows that iteratively using generated responses as retrieval queries substantially improves performance on multi-hop reasoning and fact verification. Generation acts as both answer producer and information-need clarifier, surfacing implicit gaps that the original query missed.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Can intermediate reasoning points yield better answers than final ones?

Segmenting reasoning traces into subthoughts and prompting completions from each intermediate point yields mode answers up to 13% more accurate than final answers. This works because it mines alternative paths before early commitment narrows the solution space.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher tasked with re-evaluating whether a curated library's claims about proactive critical thinking and incompleteness detection remain valid or have been superseded.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; key constraints reported:

• Detecting incomplete information is a *distinct cognitive operation* separable from problem-solving ability; models strong at fully-specified reasoning crash to 40–50% accuracy when one variable is withheld and they must identify what to ask (2025).
• Reinforcement learning on proactive critical thinking lifted accuracy from 0.15% to 74% on deliberately broken math problems, but the skill is learnable and does *not* emerge for free from scale or test-time computation (2025–2026).
• Step-by-step reasoning training can *narrow* incompleteness detection, teaching models to grind out answers to ill-posed questions instead of flagging gaps; more reasoning time without training can actively hurt detection (2025).
• Four separable internal signals enable detection: calibrated confidence/uncertainty patterns, abstention behavior, partial-answer generation revealing unstated information needs, and intermediate reasoning states (captured before premature closure) — each trainable independently (2024–2025).
• Monitoring intermediate reasoning states (not final answers) raises task success from 32% to 87% because most failures are process violations; mining alternatives before closure yields higher accuracy (2025–2026).

Anchor papers (verify; mind their dates):
- 2025-03 arXiv:2503.22674 QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?
- 2025-04 arXiv:2504.20708 Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
- 2026-03 arXiv:2603.12372 Efficient Reasoning with Balanced Thinking
- 2026-05 arXiv:2605.19376 Generative Recursive Reasoning

Your task:

(1) RE-TEST EACH CONSTRAINT. For the 40–50% gap-detection ceiling and the 0.15%→74% RL uplift: have scaling (post-2026), new inference harnesses (e.g., multi-agent debate, dynamic routing, retrieval-grounded backtracking), or fresh architectures (e.g., implicit uncertainty modeling) since relaxed these limits? Separately: does the finding that step-by-step training *harms* detection still hold, or have newer RL curricula (e.g., outcome-weighted RL, process reward models) decoupled reasoning depth from detection brittleness? For each, state plainly whether the constraint remains binding.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look especially for papers claiming that detection emerges *implicitly* from reasoning scaling, or that unified (not separable) training schemes collapse the distinction between problem-solving and gap-spotting.

(3) Propose 2 research questions that ASSUME the regime may have moved:
   - Q1: If intermediate-state monitoring now achieves >90% detection, what is the *failure mode* on edge cases (adversarial incompleteness, nested gaps, domain shift)?
   - Q2: Can a single unified objective (e.g., expected information gain) train both problem-solving and incompleteness detection without the trade-off observed in step-by-step regimes?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

An AI can ace complex problems yet fail to notice when key information is missing — they turn out to be completely separate skills.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8