MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild
arxiv(2024)
摘要
Dynamic Facial Expression Recognition (DFER) has received significant
interest in the recent years dictated by its pivotal role in enabling empathic
and human-compatible technologies. Achieving robustness towards in-the-wild
data in DFER is particularly important for real-world applications. One of the
directions aimed at improving such models is multimodal emotion recognition
based on audio and video data. Multimodal learning in DFER increases the model
capabilities by leveraging richer, complementary data representations. Within
the field of multimodal DFER, recent methods have focused on exploiting
advances of self-supervised learning (SSL) for pre-training of strong
multimodal encoders. Another line of research has focused on adapting
pre-trained static models for DFER. In this work, we propose a different
perspective on the problem and investigate the advancement of multimodal DFER
performance by adapting SSL-pre-trained disjoint unimodal encoders. We identify
main challenges associated with this task, namely, intra-modality adaptation,
cross-modal alignment, and temporal adaptation, and propose solutions to each
of them. As a result, we demonstrate improvement over current state-of-the-art
on two popular DFER benchmarks, namely DFEW and MFAW.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要