Video Event Extraction with Multi-View Interaction Knowledge Distillation

AAAI 2024(2024)

引用 0|浏览2
暂无评分
摘要
Video event extraction (VEE) aims to extract key events and generate the event arguments for their semantic roles from the video. Despite promising results have been achieved by existing methods, they still lack an elaborate learning strategy to adequately consider: (1) inter-object interaction, which reflects the relation between objects; (2) inter-modality interaction, which aligns the features from text and video modality. In this paper, we propose a Multi-view Interaction with knowledge Distillation (MID) framework to solve the above problems with the Knowledge Distillation (KD) mechanism. Specifically, we propose the self-Relational KD (self-RKD) to enhance the inter-object interaction, where the relation between objects is measured by distance metric, and the high-level relational knowledge from the deeper layer is taken as the guidance for boosting the shallow layer in the video encoder. Meanwhile, to improve the inter-modality interaction, the Layer-to-layer KD (LKD) is proposed, which integrates additional cross-modal supervisions (i.e., the results of cross-attention) with the textual supervising signal for training each transformer decoder layer. Extensive experiments show that without any additional parameters, MID achieves the state-of-the-art performance compared to other strong methods in VEE.
更多
查看译文
关键词
NLP: Information Extraction,CV: Language and Vision,DMKM: Mining of Visual, Multimedia & Multimodal Data,ML: Multimodal Learning,NLP: Language Grounding & Multi-modal NLP
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要