Can offline RL and pragmatic inference together improve dialogue agent reliability?
This explores whether two distinct levers — reshaping the training objective (RL) and adding listener-aware reasoning at generation time (pragmatic inference) — attack different dialogue failure modes and could stack, even though the corpus doesn't have a single paper combining 'offline RL' with pragmatics under that exact label.
This reads the question as asking whether RL-based training and pragmatic, listener-modeling inference target *different* reliability problems — and whether using both could compound the gains. The corpus suggests they're complementary because they fix failures at opposite ends of the pipeline: RL reshapes *what the model is rewarded for* during training, while pragmatic inference reshapes *how the model reasons about its listener* at generation time.
The RL side of the corpus keeps surfacing the same root cause of unreliability: the reward signal optimizes the wrong horizon. CollabLLM shows that standard RLHF rewards immediate, single-turn helpfulness, which quietly trains models to be passive — guessing instead of asking clarifying questions — and that swapping in a multi-turn-aware reward that estimates long-term interaction value restores active intent discovery Why do language models respond passively instead of asking clarifying questions?. That passivity is structural, not incidental: alignment objectives themselves train agents to react rather than lead Why can't conversational AI agents take the initiative?. Other RL work shows the *how* matters too — hierarchical dialogue policies collapse to one dominant action unless meta-learning preserves variability across user types Can meta-learning prevent dialogue policies from collapsing?, and inverting RL to train *user simulators* for consistency cuts persona drift by over half Can training user simulators reduce persona drift in dialogue?.
Pragmatic inference attacks a problem RL can't easily reach: moment-to-moment self-monitoring. Endowing an agent with an 'imaginary listener' via Rational Speech Acts suppresses contradictory and generic replies at inference time — crucially, *without extra training or labels* — by having the agent check whether its utterance would actually distinguish its persona from a distractor Can imaginary listeners reduce dialogue agent contradictions?. CRSA extends this to track *both* speakers' beliefs across turns, supplying the information-theoretic, belief-state framework that token-by-token LLMs lack Can dialogue systems track both speakers' beliefs across turns?. This belief-tracking instinct is old: classic POMDP dialogue systems already maintained distributions over user intent precisely because 15–30% speech-recognition error rates make any single committed interpretation fragile Why do dialogue systems need probabilistic reasoning?.
The reason both layers are needed becomes clear from what reliability is fighting. An LLM doesn't hold a fixed character — it maintains a superposition and *samples* one at generation, so regenerating the same prompt yields different but locally-consistent answers Do large language models actually commit to a single character?. RL can bias which distribution gets learned; pragmatic inference can prune which samples actually get emitted. Reframing understanding itself as pragmatics rather than semantics — generating commands instead of classifying intents — points the same direction Can command generation replace intent classification in dialogue systems?.
The honest gap: no note in this corpus runs the *combined* experiment, and 'offline RL' specifically (learning a policy from a fixed dataset rather than live rollouts) isn't named here. But the division of labor is suggestive — RL fixes the objective, pragmatics fixes the in-context reasoning — and the unexplored prize is that pragmatic listener-modeling could itself become the *reward signal* for offline RL, turning a one-off inference trick into a learned, durable behavior.
Sources 9 notes
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
Research shows LLMs including ChatGPT cannot initiate topics, plan strategically, or lead conversations because their training optimizes for responding to queries, not creating dialogue from agent goals. This passivity is reinforced by alignment objectives and masked by fluent-sounding outputs.
Without MAML, hierarchical RL for Motivational Interviewing phases collapses to a dominant action regardless of user type. Meta-learning enables the master policy to maintain variability and adapt across diverse user profiles.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
Endowing dialogue agents with an imaginary listener via Rational Speech Acts reduces persona contradiction at inference time without NLI labels or extra training. The agent simulates whether utterances would distinguish its persona from a distractor, suppressing generic or contradictory responses.
CRSA integrates rate-distortion theory with RSA to enable bidirectional belief tracking across dialogue turns. Demonstrated on referential games and doctor-patient dialogues, it captures progression from partial to shared understanding, providing the information-theoretic framework that token-level LLM systems lack.
Real-world speech recognition achieves 15-30 percent error rates in noisy environments, making deterministic flowchart dialogue systems unworkable. POMDP-based systems handle this by maintaining belief distributions over user intent rather than committing to single interpretations.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
Rasa's dialogue understanding architecture generates domain-specific commands instead of classifying intents, eliminating annotation requirements, handling context naturally, and scaling without degradation—treating understanding as pragmatics rather than semantics.