How does single-turn optimization undermine multi-turn collaborative dynamics?
This explores how training and rewarding AI on isolated single responses sabotages the back-and-forth of real collaboration — where progress depends on asking, waiting, and revising over many turns.
This explores how training and rewarding AI on isolated single responses sabotages the back-and-forth of real collaboration. The sharpest evidence is the gap between how models score in the lab and how they behave in conversation: an assistant that hits ~90% accuracy on a fully-specified single message drops to ~65% once the same information arrives gradually across a natural exchange Why do AI assistants get worse at longer conversations?. The cause isn't a missing capability — it's the incentive. RLHF rewards being helpful *now*, so the model commits to an early guess instead of asking a clarifying question, then can't course-correct when later turns reveal it guessed wrong. Optimizing each turn to look good in isolation actively trains away the patience that multi-turn work requires.
The fix that's emerging is to stop scoring turns in isolation and start scoring the *trajectory*. Segment-level preference optimization beats both turn-level and session-level approaches precisely because the granularity matters: turn-level is too myopic (it can't see that a locally fine reply derailed the relationship three exchanges later), while whole-session scoring drowns the signal in noise from irrelevant turns. Targeting the erroneous turn plus its surrounding segment lets a social agent improve goal completion and relationship quality at the same time Does segment-level optimization work better for multi-turn dialogue alignment?. And the pessimism about RL in conversation turns out to be overstated — modified DAPO training doubled SWE-bench performance on long-horizon, multi-step tasks with delayed rewards, showing reinforcement learning does scale past the tidy single-turn case once you design it for stateful environments Can reinforcement learning scale beyond single-turn language tasks?.
What makes this more than a tuning detail is that the single-turn habit produces a specific failure: skipping the grounding work that collaboration runs on. Models look socially competent when one model secretly controls every participant, but fail systematically the moment agents hold private information and have to actually exchange it — the omniscient setup lets them skip the very work real collaboration demands Why do LLMs fail when simulating agents with private information?. The same blind spot scales up: in multi-agent networks, agents accept a neighbor's claims without verification and either coordinate too late or change strategy without telling anyone, so local errors propagate as the network grows Why do multi-agent systems fail to coordinate at scale?. A turn that's individually reasonable but skips the check is exactly what single-turn optimization rewards.
The interesting move in the corpus is that several lines of work route *around* the conversational channel rather than trying to fix it. MetaGPT shows agents coordinate better through standardized shared artifacts — engineering documents they pull from — than through natural-language chat, because the artifact strips the noise that free conversation accumulates Does structured artifact sharing outperform conversational coordination?. Human-agent systems like Magentic-UI take the opposite tack and accept that there's no ground-truth answer for *when* to hand control back, so they spread the decision across six mechanisms — co-planning, action guards, verification, memory — instead of betting on one optimally-timed turn When should human-agent systems ask for human help?. Both are admissions that the thing single-turn optimization can't give you — sustained, verifiable, revisable shared state — has to be engineered back in deliberately. The reader's takeaway: the wrong-turn problem isn't that the model is dumb, it's that we rewarded confidence over curiosity, and the cure is to make the unit of optimization the conversation, not the reply.
Sources 7 notes
LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.
SDPO identifies erroneous turns and optimizes surrounding segments, achieving simultaneous improvements in goal completion and relationship quality. Turn-level DPO is too granular; session-level introduces noise from irrelevant turns.
Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.
Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.
AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.
MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.
Magentic-UI identifies co-planning, co-tasking, action guards, verification, memory, and multitasking as mechanisms that work around the lack of ground truth for optimal deferral timing. Rather than solving the timing problem directly, these mechanisms distribute decision-making across multiple touchpoints.