Building and Evaluating Open-Domain Dialogue Corpora with Clarifying Questions
Enabling open-domain dialogue systems to ask clarifying questions when appropriate is an important direction for improving the quality of the system response. Namely, for cases when a user request is not specific enough for a conversation system to provide an answer right away, it is desirable to ask a clarifying question to increase the chances of retrieving a satisfying answer. To address the problem of ‘asking clarifying questions in opendomain dialogues’: (1) we collect and release a new dataset focused on open-domain singleand multi-turn conversations, (2) we benchmark several state-of-the-art neural baselines, and (3) we propose a pipeline consisting of offline and online steps for evaluating the quality of clarifying questions in various dialogues. These contributions are suitable as a foundation for further research.
Introduction. The ultimate goal of a conversational system is to assist users by returning an appropriate answer in response to their requests (Kiseleva et al., 2016a; Li et al., 2021). Recent progress on neural approaches to natural language processing (Devlin et al., 2019; Liu et al., 2019; Clark et al., 2020), and the availability of large amounts of conversational data have triggered a renaissance in end-to-end neural open-domain chatbots (Adiwardana et al., 2020; Roller et al., 2021; Zhang et al., 2020; Burtsev et al., 2017; Dalton et al., 2020). There has been great progress on suggesting measures to evaluate what makes a conversation satisfying for users using various human evaluation techniques (Li et al., 2019a; See et al., 2019b). Those efforts showed that suggested large pre-trained models do not always perform seamlessly (See et al., 2019a), and there are still several challenges needed to be solved for open-domain conversational systems (Huang et al., 2020). Nass and Moon (2000) conclude that people have similar expectations from talking to bots and humans.
Discussion / Conclusion. Asking clarifying questions when a user request is ambiguous is essential for developing human-like open-domain dialogue systems. In this work, we introduce a large-scale dataset that covers almost 300 different topics and is suitable as a foundation for further research. The collected dataset includes both single- and multi-turn conversations. We benchmark several state-of-the-art neural models that were fine-tuned for asking clarifying questions. Based on how these models performed, we conclude that they are solid baselines for future research that more fully explores the problem space. In this paper, we also suggest an offline automatic evaluation pipeline, which agrees with human-inloop evaluation. We publicly release the collected datasets, the code for training baselines, and our evaluation procedure in order to push forward the state-of-the-art.