How do discourse structure and dialogue state management relate to each other?
This explores how the shape of a conversation — how topics open, return, and hand off (discourse structure) — relates to the machinery that keeps track of where things stand across turns (dialogue state management), and whether one is just bookkeeping for the other.
This question is really asking whether discourse structure (the flow of topics, returns, and repairs in a conversation) and dialogue state management (the system's running record of what's been said and agreed) are the same thing seen from two angles — or whether they can come apart. The corpus suggests they're deeply linked but that the failures show up at the seam between them.
The sharpest finding is that *how* you represent state determines which discourse structures you can even support. Rigid stack-based state loses context the moment a popped topic comes back, while attention-based representations let a system reach any prior turn — which is what real conversation, with its interleaving and revisiting, actually demands Why do dialogue systems lose context when topics return?. So discourse structure isn't decoration on top of state; the wrong state model quietly forbids whole categories of natural discourse. Strikingly, structure carries so much signal that you can predict whether a dialogue succeeds almost as well from its shape alone as from its content Can conversation structure predict dialogue success better than content? — evidence that discourse trajectory is itself a kind of state worth tracking.
Where it gets interesting is what 'state' should contain. Classic dialogue state tracks slots and topics, but several notes argue the real state is about *beliefs and common ground*. Collaborative rational speech acts model state as both speakers' beliefs evolving from partial to shared understanding Can dialogue systems track both speakers' beliefs across turns? — the information-theoretic scoreboard that token-level LLMs lack. And LLMs lack it structurally: they read every later turn inside the frame of the initial prompt and can't jointly revise shared assumptions, leaving the human as the sole keeper of the conversational scoreboard Can LLMs truly update shared conversational common ground?. That reframes the relationship: discourse structure assumes a mutually-updated state, but the LLM's state is one-sided.
The corpus also splits the failure modes apart, which clarifies the relationship by showing each can break alone. Coherence can fail in four semantic ways — contradiction, coreference slips, irrelevancy, disengagement — detectable through meaning representation even when the surface text looks fine What semantic failures break dialogue coherence most realistically?. Topic-following turns out to be a missing *training signal* rather than a capacity gap: models learn what to do but not what to ignore, and a thousand synthetic distractor dialogues fix it Why do language models engage with conversational distractors?. Meanwhile, much of what holds discourse together — reference repair, topic hand-off — is implicit social work that models never develop because training rewards information prediction, not relational maintenance Why don't language models develop conversation maintenance skills?. So 'state management' in the human sense isn't a data structure at all; it's relational action.
The doorway worth walking through: managing structured discourse phases can itself be the state-management problem. Hierarchical RL for staged conversations (like motivational interviewing) collapses to one dominant move unless meta-learning preserves variability across user types Can meta-learning prevent dialogue policies from collapsing?, and persona drift — local within a turn, global across the conversation — is a state-tracking failure that multi-turn RL on consistency rewards cuts by over half Can training user simulators reduce persona drift in dialogue?. The pattern across all of this: discourse structure is the *demand*, dialogue state is the *substrate*, and almost every documented breakdown is the substrate being too rigid, too one-sided, or trained for the wrong objective to carry the structure people naturally expect.
Sources 9 notes
Research shows stack-based dialogue structures lose context when popped topics are revisited, while transformer attention enables systems to retrieve any previous turn without structural loss. Attention-based approaches naturally support the interleaved, revisiting nature of human conversation.
TRACE achieved 68% accuracy predicting dialogue success from structural features alone, matching a 70% content-based baseline. A hybrid combining both reached 80%, suggesting how agents communicate rivals what they say.
CRSA integrates rate-distortion theory with RSA to enable bidirectional belief tracking across dialogue turns. Demonstrated on referential games and doctor-patient dialogues, it captures progression from partial to shared understanding, providing the information-theoretic framework that token-level LLM systems lack.
LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.
Research using Abstract Meaning Representation identified four distinct incoherence types: contradiction, coreference inconsistency, irrelevancy, and decreased engagement. AMR-trained classifiers detect these semantic failures while text-level manipulations alone cannot.
Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.
Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.
Without MAML, hierarchical RL for Motivational Interviewing phases collapses to a dominant action regardless of user type. Meta-learning enables the master policy to maintain variability and adapt across diverse user profiles.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.