Multimodal Continuous Emotion Recognition: A Technical Report for ABAW5

arxiv(2023)

引用 6|浏览28
暂无评分
摘要
We used two multimodal models for continuous valence-arousal recognition using visual, audio, and linguistic information. The first model is the same as we used in ABAW2 and ABAW3, which employs the leader-follower attention. The second model has the same architecture for spatial and temporal encoding. As for the fusion block, it employs a compact and straightforward channel attention, borrowed from the End2You toolkit. Unlike our previous attempts that use Vggish feature directly as the audio feature, this time we feed the pre-trained VGG model using logmel-spectrogram and finetune it during the training. To make full use of the data and alleviate over-fitting, cross-validation is carried out. The fold with the highest concordance correlation coefficient is selected for submission. The code is to be available at https://github.com/sucv/ABAW5.
更多
查看译文
关键词
ABAW2,ABAW5,audio feature,audio, information,compact channel attention,continuous valence-arousal recognition,fusion block,leader-follower attention,linguistic information,multimodal continuous emotion recognition,multimodal models,pre-trained VGG model,spatial encoding,straightforward channel attention,technical report,temporal encoding,Vggish feature,visual information
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要