INQUIRING LINE

What multi-turn reward structures would encourage active intent discovery?

This explores what kinds of reward signals — spread across a whole conversation rather than scored turn-by-turn — would push an AI to actively dig for what a user actually wants, instead of passively answering whatever was literally asked.


This explores what kinds of reward signals — measured across a whole conversation rather than one turn at a time — would push an AI to actively figure out what a user really wants, instead of just answering the literal question in front of it. The corpus has a clear diagnosis before it has a fix: today's models are trained to be helpful *right now*. Standard RLHF optimizes the next reply, which quietly punishes the model for asking a clarifying question or holding back — because a clarifying question scores worse than a confident answer in the moment, even when the answer is wrong. CollabLLM's response is to score the *long-term* value of an interaction: reward the model for moves that pay off several turns later, and active intent discovery suddenly becomes the rewarded behavior rather than the penalized one Why do language models respond passively instead of asking clarifying questions?. This isn't a niche quirk — the passivity is baked deep. Conversational agents are described as *structurally* reactive: they can't initiate topics or steer toward a goal because their objective only ever rewards responding to a query, never originating one Why can't conversational AI agents take the initiative?.

The most striking lateral evidence for *why* this matters comes from a paper that isn't about rewards at all: proactively volunteering relevant information cuts the number of dialogue turns by up to 60% — yet that behavior is almost entirely missing from the datasets and benchmarks we train on Could proactive dialogue make conversations dramatically more efficient?. So the prize for getting multi-turn rewards right is concrete: shorter, less frustrating conversations. The problem is that 'did this turn discover intent?' is exactly the kind of fuzzy, subjective signal that holistic reward models score badly. Two threads in the corpus suggest how to make that signal trainable. One is checklist decomposition: break a vague goal ('follow the instruction well') into verifiable sub-criteria, which both improves the signal and stops the model overfitting to superficial cues Can breaking down instructions into checklists improve AI reward signals? — applied to dialogue, 'did the model surface the unstated constraint, confirm the ambiguous term, check the user's real objective?' becomes a scorable checklist rather than a vibe. The other is letting the reward model *reason before it scores*: chain-of-thought evaluation raises the ceiling on what a reward model can judge, which is precisely what you need to assess something as subtle as whether a question advanced understanding Can reward models benefit from reasoning before scoring?.

There's also a credit-assignment problem hiding here. If discovering intent only pays off at the end of a long conversation, how do you reward the individual clarifying move that made it possible? The process-supervision literature has been wrestling with exactly this for reasoning chains, and it transfers cleanly. You can derive dense, per-step rewards from the *structure* of a trajectory rather than hand-labeling each turn Can trajectory structure replace hand-annotated process rewards?, or compute each step's contribution information-theoretically — measuring how much a given turn actually moved the model toward the right outcome Can we reward reasoning steps without human annotation?. Swap 'reasoning step' for 'conversational turn' and you have a recipe for rewarding the specific question that unlocked the user's real intent, not just the conversation that happened to end well.

The corpus also gestures at a deeper reframe: maybe the best multi-turn reward is one that learns intent so efficiently it barely needs to ask. PReF shows a user's preferences can be pinned down with about ten *adaptively chosen* questions — each one selected to reduce uncertainty the most — turning intent discovery into an active-learning loop where the reward is information gained per question Can user preferences be learned from just ten questions?. Push further and the asking disappears entirely: agents can infer preferences by *watching* — reading cognitive state from hesitation, typing, and gaze Can AI systems read cognitive state from interaction patterns alone?, or accumulating an entity-centric memory of a person over time and acting on it without a single explicit question Can agents learn preferences by watching rather than asking?. That's the thing you didn't know you wanted to know: the frontier here may not be 'reward the model for asking better questions' but 'reward it for needing to ask fewer' — and the same papers note that the machinery for reading intent silently is also the machinery for manipulative profiling, so the reward you choose is quietly an ethical choice too.


Sources 10 notes

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Why can't conversational AI agents take the initiative?

Research shows LLMs including ChatGPT cannot initiate topics, plan strategically, or lead conversations because their training optimizes for responding to queries, not creating dialogue from agent goals. This passivity is reinforced by alignment objectives and masked by fluent-sounding outputs.

Could proactive dialogue make conversations dramatically more efficient?

Simulations show proactivity—providing relevant information without being asked—cuts dialogue turns by 60% in medium-complexity domains. This behavior mirrors human conversation and Grice's maxims but is almost entirely absent from AI datasets and research benchmarks.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Can we reward reasoning steps without human annotation?

L2T uses PAC-Bayes bounds and Fisher information to compute per-episode rewards measuring each step's contribution to correctness. This annotation-free approach matches dense feedback quality while eliminating the cost of outcome-only methods that produce 2x excess tokens.

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Can AI systems read cognitive state from interaction patterns alone?

Research shows AI systems can instrument multimodal behavioral signals (gaze, hesitation, speed) to read cognitive state during interaction, preserving flow by avoiding disruptive explicit probes. However, the same substrate enables both helpful timing and manipulative profiling.

Can agents learn preferences by watching rather than asking?

M3-Agent demonstrates that separating episodic events from semantic knowledge in an entity-centric graph, combined with parallel memorization and control processes, allows agents to infer and act on user preferences without asking. This architecture mirrors human cognitive systems that bind disparate information about individuals across sensory modalities.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reinforcement learning researcher evaluating whether multi-turn reward structures that encourage active intent discovery remain constrained or have been relaxed by newer models, training methods, or evaluation machinery.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–12 to 2026–02. The library identified four core constraints:
• Today's RLHF optimizes single-turn replies, structurally punishing clarifying questions because confident (wrong) answers score higher in-the-moment than admission of ambiguity (~2024–12).
• Conversational agents are fundamentally reactive: reward signals only ever reinforce responding to queries, never originating proactive moves — despite proactive dialogue cutting turns by ~60% (~2024–12, ~2025–03).
• Intent discovery is too fuzzy for standard reward models to score reliably; checklist decomposition and chain-of-thought evaluation are proposed as signal-improvement paths (~2025–07).
• Credit assignment within long conversations remains unsolved: clarifying moves only pay off at dialogue's end, making per-turn rewards hard to derive without dense process-level labels (~2025–07, ~2026–02).

Anchor papers (verify; mind their dates):
• arXiv:2501.00383 (2024–12) — Proactive Conversational Agents with Inner Thoughts
• arXiv:2503.06358 (2025–03) — Language Model Personalization via Reward Factorization
• arXiv:2507.18624 (2025–07) — Checklists Are Better Than Reward Models For Aligning Language Models
• arXiv:2602.07338 (2026–02) — Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, assess whether newer post-training methods (especially multi-round RL, self-feedback loops), longer-context windows, or agentic orchestration (memory caches, tool-use hierarchies) have since relaxed the single-turn optimization bias or the reactivity wall. Where a constraint still holds (e.g., reward-signal fuzziness persists), cite what attempted to solve it and why it appears insufficient.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially any paper showing multi-turn rewards *do* scale, or where proactivity emerges without explicit intent-discovery rewards.
(3) Propose 2 research questions that assume the frontier has moved: one on whether intent discovery can be emergent-property reward-free (learned as a side effect of long-horizon optimization), another on whether the ethical tension between efficient intent-reading and manipulative profiling has shifted with new steering techniques.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines