MRCap: Multi-modal and Multi-level Relationship-based Dense Video Captioning.

ICME(2023)

引用 0|浏览14
暂无评分
摘要
Dense video captioning, with the objective of describing a sequence of events in a video, has received much attention recently. As events in a video are highly correlated, leveraging relationships among events helps generate coherent captions. To utilize relationships among events, existing methods mainly enrich event representations with their context, either in the form of vision (i.e., video segments) or combining vision and language (i.e., captions). However, these methods do not explicitly exploit the correspondence between these two modalities. Moreover, the video-level context spanning multiple events is not fully exploited. In this paper, we propose MRCap, a novel relationship-based model for dense video captioning. The key of MRCap is a multi-modal and multi-level event relationship module (MMERM). MMERM exploits the correspondence between vision and language at both the event level and the video level via contrastive learning. Experiments on ActivityNet Captions and YouCook2 datasets demonstrate that MRCap achieves state-ofthe-art performance.
更多
查看译文
关键词
Dense video captioning,event,multi-modal and multi-level,relationship
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要