SYNTHESIS NOTE

Why can't conversational AI agents take the initiative?

Explores whether current LLMs lack the structural ability to lead conversations, set goals, or anticipate user needs—and what architectural changes might enable proactive dialogue.

Synthesis note · 2026-02-22 · sourced from Conversation Agents

Three independent research programs converge on the same diagnosis: current LLM-based conversational agents, including ChatGPT and GPT-4, are fundamentally reactive. They respond to user queries but cannot initiate conversations, shift topics strategically, plan with subgoals, or offer recommendations that account for context beyond the current exchange.

The definition of proactivity comes from organizational behavior: "the capability to create or control the conversation by taking the initiative and anticipating impacts on themselves or human users." This is a well-defined property, not a vague aspiration — and it is systematically absent.

The gap matters most in situations requiring active engagement from both sides: exploratory search, complex decision-making, creative problem-solving. In these contexts, a purely reactive agent forces the user to carry the entire strategic burden of the conversation. The user must know what to ask, when to redirect, and how to structure the exchange — precisely the situations where they most need help.

The structural cause is training: LLMs are trained to follow user instructions and generate next-turn responses. This produces impressive reactive capability but no mechanism for initiative. Even "proactive" features like topic suggestion are reactive — triggered by user input rather than driven by agent goals. The distinction is between responding to and creating from.

Since Does preference optimization harm conversational understanding?, single-turn helpfulness training actively works against multi-turn strategic behavior. The passive architecture is not just a missing feature — it is reinforced by the training objective. And since Why do language models sound fluent without grounding?, the absence of initiative is further masked: models that skip clarifying questions, acknowledgments, and understanding checks sound more authoritative precisely because they perform less communicative work.

The practical consequence: methods for enabling proactivity include learning to ask (clarifying questions), topic shifting, and strategy planning with RL. But these remain research proposals. The deployed state of conversational AI is passive-by-default. A comprehensive survey (Deng et al., 2023) formalizes three subtasks for proactive dialogue systems: topic-shift detection (when to transition), topic planning (which path to follow), and topic-aware response generation (producing goal-directed utterances). Target types range from topical keywords to knowledge entities to full conversational goals. Yet even this taxonomy remains underexplored in deployed systems.

The efficiency cost of passivity is quantifiable: simulated proactivity in task-oriented domains of medium complexity reduces dialogue turns by up to 60%. Since Could proactive dialogue make conversations dramatically more efficient?, the absence is not just a capability gap but a data gap — proactivity is under-represented in training datasets, so models never encounter examples of it.

Two new architectural responses to this diagnosis have emerged. The Inner Thoughts framework reverses the question from "who speaks next?" to "does the agent have something worth saying?" — equipping AI with a continuous covert thought stream and intrinsic motivation scoring (preferred by humans 82% of the time). DiscussLLM takes the complementary approach: training a "silent token" prediction so models explicitly learn when NOT to intervene, formalizing the silence/speak decision as a classification task. Both recognize that the missing capability is not generating better responses but deciding whether to respond at all.

ProAgent: intention inference as proactivity mechanism (from Arxiv/Agents Multi): ProAgent addresses passivity through a hierarchical intention inference pipeline specifically designed for cooperative multi-agent settings. The five-stage process — (1) Knowledge Library and State Grounding (transforming raw state into language descriptions), (2) High-level Skill Planning (analyzing scene + inferring teammate intentions), (3) Belief Correction (updating beliefs based on observed actual behavior), (4) Skill Validation (checking and replanning if needed), (5) Memory Storage (accumulating decision context) — represents a concrete architecture for proactive behavior. The belief correction mechanism is key: rather than assuming static teammate behavior, ProAgent dynamically adjusts beliefs about partner intentions based on discrepancies between predicted and observed actions. This enables zero-shot coordination with unfamiliar teammates — addressing the passivity problem not through learned conversational initiative but through real-time social modeling. The distinction matters: passivity in human-AI interaction (failing to lead conversation) and passivity in AI-AI cooperation (failing to anticipate teammates) have different surface manifestations but share the same root cause — absence of goal-aware, other-modeling behavior.

Production agent deployment gap (from Arxiv/Agents): OpenAgents' real-world deployment reveals three concrete instantiations of passivity beyond conversational initiative. First, effective application specification via prompting requires instructions that cater to backend logic, output aesthetics, and adversarial safeguards — the instruction volume can exceed token limitations, meaning agents can't fully specify their own operational context. Second, real-time interactive scenarios like streaming are essential for acceptable user experience but are engineering-complex to implement with current LLM architectures. Third, current research gravitates toward idealized performance metrics while sidelining critical trade-offs between system responsiveness and accuracy, and the nuanced complexities of application-based failures. The gap between benchmarked and deployed agent performance is systematic, not incidental — and since Why do AI agents fail at workplace social interaction?, the 30% completion figure confirms that real-world complexity surfaces failures invisible in benchmarks.

Inquiring lines that read this note 83

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How can language models sustain linguistic synchrony and intersubjectivity during dialogue?

How should conversational agents balance goal-driven initiative with user control?

How does AI assistance affect human cognitive development and reasoning autonomy?

Why do multi-turn conversations degrade AI intent and coherence?

When should tasks involve human-AI partnership versus full automation?

Does RLHF training sacrifice accuracy and grounding for user agreement?

Why does preference optimization erode conversational grounding in AI assistants?

How can LLM user simulators model realistic goal-driven conversation?

How do interface design choices shape consciousness attribution?

Can AI be used as a channel for human-initiated alarm?

How should human oversight be integrated with autonomous AI systems?

How does treating AI as an agent affect user autonomy and decision-making?

How should dialogue systems represent uncertainty from noisy speech input?

How does objective evolution guide discovery better than fixed planning?

How should memory consolidation strategies shape agent performance over time?

What memory and planning capabilities do AI companions need for evolving user needs?

What coordination failures limit multi-agent LLM systems as they scale?

Why do LLM chatbots fail as independent therapeutic agents?

What architectural changes would enable proactive therapeutic guidance in chatbots?

Should GUI agents use structured representations instead of raw pixels?

What design discipline replaces navigation and layout in AI systems?

Why do language models reinforce false assumptions instead of correcting them?

Why do large language models fail at taking conversational initiative?

How do standardized protocols improve coordination in multi-agent systems?

How can AI agents autonomously learn and transfer skills across tasks?

Can next-state supervision work across different agent interaction types like conversations and tool calls?

Does conversational format create illusions of genuine AI communication?

How should we design LLM systems to maintain alignment and control?

What interaction controls matter most for effective human-LLM collaboration?

Why do reward structures fail to shape long-term agent learning?

Can architectural changes like decoupling intent understanding help overcome next-turn reward limitations?

How do language models inherit human biases from training data?

Can LLMs coordinate with humans better using different model architectures?

Can prompting inject entirely new knowledge into language models?

How do we evaluate AI systems when user perception misleads actual performance?

Which AI capabilities matter most for human-facing deployment contexts?

How do professional roles and expertise transform with AI-generated content?

How do formal dialogue structures reveal conversation coherence mechanisms?

How does conversational context fail as an authorization enforcement layer?

Can next-token prediction alone produce genuine language understanding?

Why do standard next-token prediction models struggle with conversational initiative?

Can debate mechanisms prevent silent agreement on wrong answers in multi-agent reasoning?

Why do multi-agent LLM systems converge prematurely without genuine deliberation or probing?

Related concepts in this collection 12

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

28 direct connections · 190 in 2-hop network ·medium cluster Open in graph ↗

Why can't conversational AI agents take the init… Does preference optimization harm conversational u… Do language models actually build shared understan… Can AI agents learn when they have something worth… Can models learn when NOT to speak in conversation… Why do language models fail in gradually revealed … Could proactive dialogue make conversations dramat… When should proactive agents push toward their goa… Why do language models sound fluent without ground…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does preference optimization harm conversational understanding? Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
single-turn training reinforces passivity
Do language models actually build shared understanding in conversation? When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.
another manifestation of reactive design; no active grounding effort
Can AI agents learn when they have something worth saying? What if AI proactivity came from modeling intrinsic motivation to participate rather than predicting who speaks next? This explores whether a framework based on human cognitive patterns—internal thought generation parallel to conversation—can make agents genuinely responsive rather than passively reactive.
strongest architectural answer: covert thought generation + intrinsic motivation
Can models learn when NOT to speak in conversations? Does training AI to explicitly predict silence—through a dedicated silent token—help models understand when intervention adds value versus when they should stay quiet? This matters for building conversational agents that feel naturally helpful rather than intrusive.
complementary approach: explicit silence/speak classification
Why do language models fail in gradually revealed conversations? Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
39% multi-turn degradation is the empirical cost of passivity
Could proactive dialogue make conversations dramatically more efficient? Explores whether AI systems that volunteer relevant unrequested information could significantly reduce the back-and-forth turns required in task-oriented conversations, and why this behavior is missing from training data.
quantifies the efficiency cost of passivity
When should proactive agents push toward their goals versus accommodate users? Proactive dialogue agents face a tension between reaching their objectives efficiently and keeping users satisfied. This question explores whether these two aims can coexist or require constant negotiation.
proactivity creates new challenges when users are non-cooperative
Why do language models sound fluent without grounding? Explores whether LLM fluency masks the absence of communicative work—the clarifying questions, acknowledgments, and understanding checks that humans perform. Why does skipping these acts make models sound more confident?
passivity and the grounding gap are complementary: passivity describes the absence of initiative; the grounding gap describes the absence of communicative accountability; both are training consequences that get rewarded as fluency
Does RLHF training push therapy chatbots toward problem-solving? Explores whether reward signals optimizing for task completion in RLHF inadvertently train therapeutic chatbots to prioritize solutions over emotional validation, potentially undermining clinical effectiveness.
in therapeutic contexts passivity combines with the problem-solving bias: the model only responds (passive) and when it does it defaults to task completion (problem-solving); the clinical need is for initiative toward emotional attunement
Do LLMs predict persuasion based on actual dialogue or training bias? Why do large language models consistently predict concession-based persuasion intentions even when dialogue context suggests otherwise? Understanding this gap reveals how alignment training shapes not just model behavior but also how models perceive others' intentions.
the alignment-induced passivity extends to social modeling: RLHF not only makes agents passive in behavior but biases their predictions about others toward accommodation, projecting trained conciliatory disposition onto the agents they model
Why do standard alignment methods ignore partner interventions? Standard RLHF and DPO optimize for token-level quality but may structurally prevent agents from meaningfully incorporating partner input. This explores whether the training objective itself blocks collaborative reasoning.
ICR demonstrates the deeper mechanism: RLHF structurally cannot produce partner-aware collaboration; passivity toward partner contributions is a trained-in property, not a missing feature
Can models learn to ask clarifying questions without explicit training? Do language models trained only on fully-specified problems spontaneously develop the ability to ask for missing information when facing underspecified tasks? This tests whether conversational problem-solving strategies emerge from meta-learning rather than direct instruction.
direct training-level answer to the passivity diagnosis. Social meta-learning converts static problems into pedagogical dialogues with an information-asymmetric teacher; the resulting models proactively ask clarifying questions on underspecified tasks despite never being trained on underspecified problems. This moves "learning to ask" from research proposal to demonstrated training pattern — the passivity problem is addressable at the training level, not only via runtime architecture (Inner Thoughts) or prompt engineering.

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

llm-based conversational agents are structurally passive — they lack goal awareness initiative-taking and the ability to lead conversation beyond responding to user queries

Why can't conversational AI agents take the initiative?

Inquiring lines that read this note 83

Related concepts in this collection 12

Related papers in this collection 8

Search by related questions 4