Explainable Multimodal Emotion Reasoning

Paper · arXiv 2306.15401 · Published June 27, 2023
Multimodal ModelsEmotions and AI

Multimodal emotion recognition is an active research topic in the field of artificial intelligence. It aims to integrate multimodal clues (including acoustic, visual, and lexical clues) and recognize human emotional states from these clues. Current works generally assume correct emotion labels for benchmark datasets and focus on building more effective architectures to achieve better performance. But due to the ambiguity and subjectivity of emotion, existing datasets cannot achieve high annotation consistency (i.e., labels may be inaccurate), making it difficult for models developed on these datasets to meed the demand of practical applications. To address this problem, the core is to improve the reliability of emotion annotations. Therefore, we propose a new task called “Explainable Multimodal Emotion Reasoning (EMER)”. Unlike previous works that only predict emotional states, EMER further explains the reasons behind these predictions to enhance their reliability. In this task, rationality is the only evaluation metric. As long as the emotion reasoning process for a given video is plausible, the prediction is correct. In this paper, we make an initial attempt at this task, and establish a benchmark dataset, baselines, and evaluation metrics.

Introduction. Multimodal emotion recognition has experienced rapid development in recent years [1, 2]. Current works mainly focus on collecting larger and more realistic datasets [3, 4], or building more effective architectures [5, 6]. Despite promising results, multimodal emotion recognition suffers from label ambiguity. The reason of label ambiguity lies in the subjectivity of emotion itself, i.e., different annotators may assign distinct emotions to the same video. Due to label ambiguity, emotion labels of existing benchmark datasets may be inaccurate. Therefore, it is difficult for the systems developed on these datasets to meet the high reliability requirements in practical applications. To alleviate label ambiguity, previous works mainly employ more annotators and use majority vote to find the most relevant or several most relevant emotion labels [7, 8]. Although majority vote improves annotation reliability, this process removes correct but non-dominant labels, limiting the ability of describing subtle emotions.

Discussion / Conclusion. The reason behind these phenomena is that current multimodal LLMs are trained on image or video caption datasets, which mainly focus on clothing, environment, action, etc., rather than facial-centric descriptions. Meanwhile, current image or video caption datasets usually do not describe multimodal information, limiting their audio-video-text understanding ability. In the future, we will build stronger frameworks to improve the performance of this challenging task. In this paper, we propose a new task called “Explainable Multimodal Emotion Reasoning (EMER)”. Unlike traditional emotion recognition, EMER aims to predict emotional states and the reasons behind these predictions. The rationality of the reasoning process is the only evaluation metric. With this task, we aim to enhance the reliability of current emotion recognition systems. We establish a benchmark dataset, baselines, and evaluation metrics for EMER.