Active Listening: Personalized Question Generation in Open-Domain Social Conversation with User Model Based Prompting

Paper · Source
Personalized AssistantsQuestion Answering and SearchSynthetic Dialogue Generation

Large language models (LLMs) capable of casual conversation have recently become widely available. We hypothesize that users of conversational systems want a more personalized experience, and existing work shows that users are highly receptive to personalized questions (PQs). Question Generation tasks, however, focus on factual questions from textual excerpts. To create a PQ generator, we first identify over 400 real user interests by anonymously aggregating ∼39K user models. We then populate prompt templates with these 400 interests and use an LLM to generate PQs customized to user interests. The result is PerQs, a novel corpus of ∼19K question/answer pairs. We evaluate PerQs at scale in the unique context of the Alexa Prize. Our results show significant positive effects on perceived conversation quality. We then fine-tune, deploy, and evaluate PerQy, a neural model that generates PQs in real-time. When evaluated against several competitive LLM baselines, PerQy produced the most natural and engaging responses.

Introduction. Large language models (LLMs) capable of casual conversation have recently become widely available, leading to an increase in research in social open-domain dialogue (Higashinaka et al., 2021, 2014; Zhang et al., 2020; Kim et al., 2023; Zheng et al., 2023; Ouyang et al., 2022) inter alia. In addition, challenges like the Alexa Prize Socialbot Challenge (henceforth AP) (Gabriel et al., 2020; Hu et al., 2021b; Johnston et al., 2023) have given real users the ability to access and evaluate spoken conversational systems in their home. We hypothesize that users of such conversational systems want a more personalized experience (Ritschel et al., 2017; Sugiyama et al., 2014; Bickmore and Picard, 2005; Clark et al., 2019). Research shows that conversational partners are more well-liked if they ask more follow-up ques- tions (Huang et al., 2017), and such questions show that the hearer is listening and understanding (Bevacqua et al., 2012; Meguro et al., 2014; Reis et al., 2011; Reis and Patrick, 1996).

Discussion / Conclusion. around open-domain dialogue system design as it suggests the potential of a neuro-symbolic approach instead of relying on a single larger general model. It is worth noting that response latency is not considered during judgment, so this study does not reflect the increased risk associated with using larger, slower models in a real-time dialogue system. The differences between PerQy and the other LLMs are statistically significant (χ2 ≥24.824 and p ≤0.001) for all 4 judgments. Fewer differences are statistically significant when collapsing the Slightly and Definitely labels into a single label. generate PerQy’s training data. This may indicate that PerQy’s compact model captures core nuances specific to PQs that a general LLM loses. Using collapsed labels, Vicuna 33B, the largest model we examined, still outperforms PerQy with respect to engagement and consistency. Figure 8 and Figure 9 show instances where all five Mechanical Turkers prefer Vicuna-33B when Personalized Question Generation (PQG) is a unique task focused on generating PQs in conversations. We use an LLM to generate PerQs, a corpus of ∼19K personalized questions and answers based on real user interests.