INQUIRING LINE

Inquiring lines›What do model internals reveal abo…›How do surface signals and framing…›What makes specific clarifying que…›this inquiring line

Training AI to ask better questions works — but only if you spell out what 'better' actually means.

Can attribute-specific preference optimization improve question quality in information-seeking?

This explores whether breaking 'question quality' into named sub-attributes — and optimizing each one separately — actually produces better questions when a system is trying to gather information, rather than training against a single lumped quality score.

This explores whether breaking 'question quality' into named sub-attributes — and optimizing each one separately — produces better questions in information-seeking, versus training against one lumped quality score. The corpus has a direct, encouraging answer and several flanking results that explain *why* it works.

The centerpiece is the ALFA framework Can models learn to ask genuinely useful clarifying questions?, which decomposes question quality into theory-grounded attributes like clarity, relevance, and specificity, then trains on 80K attribute-specific preference pairs. The headline result: attribute-specific optimization beats single-score training, and the gap is widest in clinical reasoning, where asking the *right* clarifying question changes the downstream decision. So the answer to your question is yes — but the interesting part is the mechanism. A single quality score is a blurry target; a good question and a mediocre one can land on similar scalars for different reasons. Factoring quality into attributes gives the optimizer cleaner, less-entangled gradients to follow.

That 'factor the signal so gradients stop interfering' idea is not unique to questions — the corpus keeps rediscovering it. Conversational recommenders find that fusing what-to-ask, what-to-recommend, and when into one policy beats optimizing them in isolation, precisely because separated components can't pass gradient signal to each other Can unified policy learning improve conversational recommender systems?. Personalization research shows the same shape from the reward side: factorizing a user's preferences into a linear combination of base reward functions lets ten adaptive questions pin down someone's coefficients Can user preferences be learned from just ten questions?. The recurring lesson is that the *granularity* of your training signal matters as much as its volume.

There's also a prior question worth surfacing: can a model learn to ask at all, rather than guess? It can — reinforcement learning lifted proactive 'I'm missing information, let me ask' behavior from near-zero to ~74% on deliberately under-specified problems, and notably the skill is fragile without explicit training Can models learn to ask clarifying questions instead of guessing?. Attribute optimization is the natural next layer on top: first teach the model *to* ask, then teach it to ask *well* along specific dimensions. And whether attributes are even the right axes depends on the task — non-factoid work shows different question types (comparison, debate, experience) demand genuinely different handling, so 'quality' isn't one thing across the board Does question type determine the right retrieval strategy?.

One caution the corpus hands you for free: be careful what you measure. Supervised fine-tuning can raise final-answer accuracy while quietly degrading the quality of the reasoning steps that produced it, because standard metrics only score the endpoint Does supervised fine-tuning improve reasoning or just answers?. The deeper argument for attribute-specific optimization is that it refuses that trap — by scoring clarity, specificity, and relevance directly, it measures the qualities you actually care about instead of trusting a single number to stand in for all of them.

Sources 6 notes

Can models learn to ask genuinely useful clarifying questions?

The ALFA framework breaks down question quality into theory-grounded attributes (clarity, relevance, specificity) and trains models on 80K attribute-specific preference pairs. Attribute-specific optimization outperforms single-score training, especially in clinical reasoning where asking the right clarifying question directly impacts decision quality.

Can unified policy learning improve conversational recommender systems?

Research shows that formulating attribute-asking, item-recommending, and timing decisions as a single graph-based RL policy achieves better joint optimization than isolated components. Separation prevents gradient signals from informing one another and fails to optimize conversation trajectory holistically.

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Can models learn to ask clarifying questions instead of guessing?

Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.

Does question type determine the right retrieval strategy?

Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.

Show all 6 sources

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Aligning LLMs to Ask Good Questions A Case Study in Clinical Reasoning1.75 match · arxiv ↗
Learning to Learn from Language Feedback with Social Meta-Learning1.67 match · arxiv ↗
Researchy Questions: A Dataset of Multi-Perspective, Decompositional Questions for LLM Web Agents1.59 match · arxiv ↗
Language Model Personalization via Reward Factorization0.89 match · arxiv ↗
Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration0.89 match · arxiv ↗
Unified Conversational Recommendation Policy Learning via Graph-based Reinforcement Learning0.89 match · arxiv ↗
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!0.88 match · arxiv ↗
Learning to Reason for Factuality0.87 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether attribute-specific preference optimization for question quality remains a live frontier or has been subsumed by newer LLM capabilities, training methods, or evaluation paradigms.

What a curated library found — and when (dated claims, not current truth):
Findings span 2021–2025. A library centered on ALFA (arXiv:2502.14860, 2025-02) discovered:

• Decomposing question quality into named attributes (clarity, relevance, specificity) and optimizing each separately beats single-score training; the gap is widest in clinical reasoning where question choice affects downstream decisions (2025-02).
• The mechanism: factorized gradients reduce entanglement; conversational recommenders show the same pattern when separating 'what-to-ask', 'what-to-recommend', 'when' into independent policies (2021-05, 2025-02).
• Reward factorization can represent user preferences as linear combinations of base functions, enabling adaptive personalization (2025-03, arXiv:2503.06358).
• Proactive question-asking behavior (identifying missing information) can be raised from near-zero to ~74% on under-specified tasks via RL, but skill is fragile without explicit training (2025-07, arXiv:2507.23407).
• Non-factoid question types (comparison, debate, experience) require type-specific handling; a single 'quality' axis does not transfer across task domains (2025-03, arXiv:2503.15879).

Anchor papers (verify; mind their dates):
• arXiv:2502.14860 (2025-02) — ALFA: Aligning LLMs to Ask Good Questions via attribute decomposition in clinical reasoning.
• arXiv:2507.23407 (2025-07) — Proactive Critical Thinking: fostering human-AI collaboration through guided questioning.
• arXiv:2503.06358 (2025-03) — Reward Factorization: personalization via linear combinations of base reward functions.
• arXiv:2503.15879 (2025-03) — Typed-RAG: type-aware decomposition for non-factoid QA.

Your task:
(1) RE-TEST EACH CONSTRAINT. For attribute decomposition (esp. ALFA), does scaling, in-context learning, or agentic orchestration now obviate the need for explicit multi-attribute training? Test whether newer models (o1, Claude 3.5+) achieve high question quality under a single reward signal or if factorization still provides measurable gains. Probe whether the 'fragile proactive asking' result (74%) still holds or whether chain-of-thought or retrieval-augmented prompting has relaxed that ceiling. Distinguish the durable question — *what makes a question good in context?* — from the perishable claim that single-score training is insufficient.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for results showing: (a) multi-attribute decomposition is redundant given modern LLM priors; (b) agentic search or reasoning agents auto-discover good question-asking without explicit supervision; (c) end-to-end finetuning on curated question datasets outpaces attribute-specific methods; or (d) unified reward models trained on aggregated preference data match or beat ALFA.

(3) Propose two research questions that ASSUME the regime may have moved: (a) Does attribute decomposition remain necessary when questions are generated within a multi-step agentic loop (search, reasoning, synthesis)? (b) Can a single, learned meta-reward that *internally* decomposes attributes on the fly match explicit factorization without hand-labeling attribute pairs?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training AI to ask better questions works — but only if you spell out what 'better' actually means.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8