Explainable Multimodal Emotion Reasoning
arxiv(2023)
摘要
Multimodal emotion recognition is an active research topic in artificial
intelligence. Its main goal is to integrate multi-modalities to identify human
emotional states. Current works generally assume accurate emotion labels for
benchmark datasets and focus on developing more effective architectures.
However, emotions have inherent ambiguity and subjectivity. To obtain more
reliable labels, existing datasets usually restrict the label space to some
basic categories, then hire multiple annotators and use majority voting to
select the most likely label. However, this process may cause some correct but
non-candidate or non-majority labels to be ignored. To improve reliability
without ignoring subtle emotions, we propose a new task called "Explainable
Multimodal Emotion Reasoning (EMER)". In contrast to traditional tasks that
focus on predicting emotions, EMER takes a step further by providing
explanations for these predictions. Through this task, we can extract more
reliable labels since each label has a certain basis. Meanwhile, we use LLMs to
disambiguate unimodal descriptions and generate more complete multimodal EMER
descriptions. From them, we can extract more subtle labels, providing a
promising approach for open-vocabulary emotion recognition. This paper presents
our initial efforts, where we introduce a new dataset, establish baselines, and
define evaluation metrics. In addition, EMER can also be used as a benchmark
dataset to evaluate the audio-video-text understanding capabilities of
multimodal LLMs. To facilitate further research, we will make the code and data
available at: https://github.com/zeroQiaoba/AffectGPT.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要