Evaluating Emotional Nuances In Dialogue Summarization

Paper · arXiv 2307.12371 · Published July 23, 2023
Dialog Topics and ModelingEmotions and AINLP and LinguisticsSentiment, Semantics, and Toxicity Detection

Automatic dialogue summarization is a well-established task that aims to identify the most important content from human conversations to create a short textual summary. Despite recent progress in the field, we show that most of the research has focused on summarizing the factual information, leaving aside the affective content, which can yet convey useful information to analyse, monitor, or support human interactions. In this paper, we propose and evaluate a set of measures PEmo, to quantify how much emotion is preserved in dialog summaries. Results show that, summarization models of the state-of-the-art do not preserve well the emotional content in the summaries. We also show that by reducing the training set to only emotional dialogues, the emotional content is better preserved in the generated summaries, while conserving the most salient factual information.

Introduction. Automatic dialogue summarization has been widely studied and applied to various domains, including meeting [Carletta et al., 2005, Zhong et al., 2021], chat [Gliwa et al., 2019], email thread [Zhang et al., 2021], media interview [Zhu et al., 2021], customer service [Feigenblat et al., 2021, Macary et al., 2020, Favre et al., 2015, Lin et al., 2021] and medical dialogue [Song et al., 2020], etc. However, most research has focused on summarizing factual information, leaving aside affective content. Affective content has been the target of a few summarization tasks such as opinion summarization [Wang and Ling, 2016]. However, opinion is only a subset of affective expressions and such task mainly focuses on non-dialogue texts reviews. In dialogues, factual information and subjective content are often intertwined. While it is essential to summarize the most pertinent factual information, the subjective content can also provide valuable insights, for example, improve customer service interactions and better support patients in the context of e-healthcare.

Discussion / Conclusion. In dialogue summarization, the most important content almost always focuses on factual information, leaving aside the affective content of the interaction. We argue that emotional information is important content to report in dialogue summaries. In order to measure emotional content in dialogues and in summaries, we train SA models at the word level. We conduct a corpus-based analysis on the DialogSum corpus, in which dataset annotators were explicitly instructed to include emotion when writing reference summaries, we show that emotion is omitted to some extent in reference summaries in dialogue datasets. We then propose a new set of measures to evaluate the relevance of a summary based on the emotional load (proportion) and its polarity (PEmoP / PEmoN). Using this measure, we show that the summarization model often exhibits a mismatch between the emotional content of the input dialogue and the summary. We also show that by carefully selecting the training target, we can decrease this mismatch. This method provides a more comprehensive measure of dialogue summarization performance. In this study, we chose the DialogSum corpus for analysis because it explicitly considers emotions in its annotation guidelines.