No that's not what I meant: Handling Third Position Repair in Conversational Question Answering

Paper · arXiv 2307.16689 · Published July 31, 2023
Question Answering and SearchConversational AgentsNLP and Linguistics

The ability to handle miscommunication is crucial to robust and faithful conversational AI. People usually deal with miscommunication immediately as they detect it, using highly systematic interactional mechanisms called repair. One important type of repair is Third Position Repair (TPR) whereby a speaker is initially misunderstood but then corrects the misunderstanding as it becomes apparent after the addressee’s erroneous response (see Fig. 1). Here, we collect and publicly release REPAIR-QA1, the first large dataset of TPRs in a conversational question answering (QA) setting. The data is comprised of the TPR turns, corresponding dialogue contexts, and candidate repairs of the original turn for execution of TPRs. We demonstrate the usefulness of the data by training and evaluating strong baseline models for executing TPRs. For stand-alone TPR execution, we perform both automatic and human evaluations on a fine-tuned T5 model, as well as OpenAI’s GPT-3 LLMs. Additionally, we extrinsically evaluate the LLMs’ TPR processing capabilities in the downstream conversational QA task. The results indicate poor out-of-thebox performance on TPR’s by the GPT-3 models, which then significantly improves when exposed to REPAIR-QA.

Introduction. Participants in conversation need to work together on a moment by moment basis to achieve shared understanding and coordination (Clark, 1996; Clark and Brennan, 1991; Goodwin, 1981; Healey et al., 2018; Mills, 2007). One of the key interactional mechanisms that enables this is called repair (Schegloff et al., 1977; Schegloff, 1992) – see Fig. 1: a set of universal, highly systematised (Dingemanse et al., 2015), local methods for dealing with miscommunication as it is detected. Miscommunication likewise arises in humanmachine conversation. Therefore, the ability to interpret and generate effective repair sequences is crucial to robust Conversational AI technology, and to ensuring that Natural Language Understanding (NLU) output and/or subsequent system responses remain faithful to what the user intended.

Discussion / Conclusion. The ability to interpret and generate repairs is essential to robust and faithful Conversational AI. In this paper, we focused on Third Position Repair (TPR) that’s been largely neglected in the NLP community. We collect, analyse and release the first large dataset of TPRs and use it to evaluate strong baseline repair execution models, as well as the conversational QA performance of Open AI’s Davinci model when it encounters TPRs. The results show very poor out-of-the-box performance on TPRs which then improves when the model is exposed to the REPAIR-QA dataset. But even then, Davinci does not exhibit an acceptable performance on TPRs when evaluated end to end in a Conversational QA setting. This is a symptom of the sparsity of TPRs in the original dialogic data used to pretrain Davinci and LLMs in general; and suggests that LLM researchers should be more selective in how they compile the datasets used for pretraining. For this paper, we did not have a chance to evaluate later releases of LLMs (e.g. GPT3.5; GPT4) - it would be telling to see how much performance improvement the later models might exhibit on TPRs.