Cross-Domain Modality Fusion for Dense Video Captioning

IEEE Transactions on Artificial Intelligence(2022)

引用 2|浏览0
暂无评分
摘要
Dense video captioning requires localization and description of multiple events in long videos. Prior works detect events in videos solely relying on the visual content and completely ignore the semantics (captions) related to the events. This is undesirable because human-provided captions often also describe events that are visually nonpresent or subtle to detect. In this research, we propose to capitalize on this natural kinship between events and their human-provided descriptions. We propose a semantic contextualization network to encode the visual content of videos by representing it in a semantic space. The representation is further refined to incorporate temporal information and transformed into event descriptors using a hierarchical application of short Fourier transform. Our proposal network exploits the fusion of semantic and visual content enabling it to generate semantically meaningful event proposals. For each proposed event, we attentively fuse its hidden state and descriptors to compute discriminative representation for the subsequent captioning network. Thorough experiments on the standard large-scale ActivityNet Captions dataset and additionally on the YouCook-II dataset show that our method achieves competitive or better performance on multiple popular metrics for the problem.
更多
查看译文
关键词
Context modeling,dense video captioning (DVC),event localization,language and vision,video captioning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要