SDPO: Segment-Level Direct Preference Optimization for Social Agents
Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex goaloriented social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across a variety of agent tasks. Existing DPO-based approaches for multi-turn interactions are divided into turn-level and sessionlevel methods. The turn-level method is overly fine-grained, focusing exclusively on individual turns, while session-level methods are too coarse-grained, often introducing training noise. To address these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which focuses on specific key segments within interactions to optimize multiturn agent behavior while minimizing training noise. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO’s potential to advance the social intelligence of LLM-based agents. We release our code and data at this url.
Introduction. Recent advancements in large language models (LLMs) have significantly enhanced their capabilities in language understanding and generation, particularly within the realm of human-machine interaction. By incorporating identity-specific information, LLM-based agents can simulate human social behaviors, demonstrating basic social intelligence in tasks such as role-playing casual conversations (Wang et al., 2024a; Lu et al., 2024) and navigate simulated social environments (Park et al., 2023). However, recent studies (Zhou et al., 2024) have shown that, in more complex, goal-oriented social scenarios, such as negotiation, competition, and cooperation, LLMs still struggle to exhibit the nuanced decision-making abilities that are characteristic of human social interactions. In response to these challenges, several methods have been developed to better align LLM behavior with human preferences in multi-turn interactions. These approaches offer promising strategies for improving social decision-making in LLMs.
Discussion / Conclusion. In this paper, we introduce Segment-Level Direct Preference Optimization (SDPO) to improve the performance of LLM-based agents in multiturn, goal-oriented social dialogues. Unlike existing alignment methods such as turn-level DPO and session-level approaches including ETO and DMPO, SDPO focuses on optimizing the agent policy by targeting specific key segments within a session. Our extensive evaluation on the SOTOPIA benchmark shows that SDPO significantly outperforms existing methods, highlighting the superiority of segment-level alignment. Looking ahead, we plan to apply SDPO to other agent tasks to further explore its versatility and effectiveness.