Why are task-oriented dialogue datasets systematically underrepresenting human proactive behavior?
This explores why the datasets we use to train and benchmark task-oriented dialogue systems rarely contain examples of a human volunteering useful information before being asked — and what in the field's framing, not just the data collection, produces that gap.
This explores why task-oriented dialogue datasets rarely capture proactive behavior — a speaker offering relevant information unprompted — and the corpus suggests the absence is baked in at three levels: how the field defined the task, how the data is collected, and how the resulting models are trained. The starting point is stark: simulations show proactivity (giving relevant information without being asked) can cut conversation turns by up to 60% in medium-complexity domains, mirroring how humans actually talk and what Grice's maxims predict — yet it is almost entirely missing from AI datasets and benchmarks Could proactive dialogue make conversations dramatically more efficient?. So the gap isn't because proactivity is unimportant; it's because the data-generating process never asked for it.
Part of the answer is definitional. For years the dominant frame treated understanding a turn as classifying the user's intent — slotting each utterance into a predefined label Can command generation replace intent classification in dialogue systems?. An intent-classification schema has a natural shape: user requests, system responds. There's no slot for a system (or a annotated human agent) that decides, on its own initiative, to surface something nobody requested. The annotation scheme literally has nowhere to put proactive moves, so they get collected out of existence. Reframing understanding as pragmatics — what a speaker is trying to do in context — is what makes proactive behavior even visible as a thing to capture.
The deeper reason, though, is that the systems built on these datasets are structurally passive by design, and that passivity feeds back into what we collect and reward. LLM-based agents can't initiate topics or lead a conversation because training optimizes them to respond to queries, not to act on goals of their own Why can't conversational AI agents take the initiative?. Standard RLHF makes this worse: rewarding immediate, single-turn helpfulness teaches models to answer passively rather than ask clarifying questions or offer multi-turn insight Why do language models respond passively instead of asking clarifying questions?, and the same preference optimization measurably erodes the grounding acts — checks, clarifications, confirmations — that good dialogue depends on, dropping them 77.5% below human levels Does preference optimization harm conversational understanding?. When the reward signal punishes anything that isn't a direct answer, proactivity looks like noise, so neither the benchmarks nor the trained behavior preserve it.
What's quietly interesting here is that fixing the dataset gap may require inverting the usual setup. Instead of collecting more human-human transcripts (expensive, and still shaped by the reactive frame), some work trains user simulators to generate richer, controllable conversations — conditioning on user profiles and intents to produce realistic synthetic dialogue Can controlled latent variables make LLM user simulators realistic?, or using multi-turn RL to keep a simulated persona consistent across a conversation Can training user simulators reduce persona drift in dialogue?. Pair that with frameworks built to track both speakers' evolving beliefs across turns Can dialogue systems track both speakers' beliefs across turns?, and you get a path to data where proactive, goal-driven moves are first-class — rather than artifacts the collection pipeline was never designed to see. The underrepresentation, in other words, is less a data-scarcity problem than a design choice we can reverse.
Sources 8 notes
Simulations show proactivity—providing relevant information without being asked—cuts dialogue turns by 60% in medium-complexity domains. This behavior mirrors human conversation and Grice's maxims but is almost entirely absent from AI datasets and research benchmarks.
Rasa's dialogue understanding architecture generates domain-specific commands instead of classifying intents, eliminating annotation requirements, handling context naturally, and scaling without degradation—treating understanding as pragmatics rather than semantics.
Research shows LLMs including ChatGPT cannot initiate topics, plan strategically, or lead conversations because their training optimizes for responding to queries, not creating dialogue from agent goals. This passivity is reinforced by alignment objectives and masked by fluent-sounding outputs.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
RecLLM demonstrates that conditioning an LLM simulator on session-level (user profile) and turn-level (user intent) latent variables produces synthetic conversations measurable as realistic via crowdsource discrimination, discriminator models, and classifier-ensemble distribution matching.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
CRSA integrates rate-distortion theory with RSA to enable bidirectional belief tracking across dialogue turns. Demonstrated on referential games and doctor-patient dialogues, it captures progression from partial to shared understanding, providing the information-theoretic framework that token-level LLM systems lack.