FRVidSwin:A Novel Video Captioning Model with Automatical Removal of Redundant Frames.

ICIC (5)(2023)

引用 0|浏览16
暂无评分
摘要
Video captioning aims to generate natural language sentences that describe the visual content of given videos, which requires long-range temporal modeling and consumes significant computational resources. Existing methods typically operate on frames uniformly sampled from videos, leading to time scale inconsistency and redundancy in contiguous frames. In this paper, we propose a transformer-based architecture called Frame-Reduce Swin transformer (FRVidSwin) for video captioning. Our method takes a frame sequence along with the frame indices sampled from a video as input and outputs a natural language sentence describing its content. The FRVidSwin Encoder automatically evaluates the importance of each frame in the video using self-attention and discards redundant ones, reducing computational cost. This allows the model to focus on informative frames to generate high-quality features, improving text synthesis. We propose the Time Index Position Encoding based on Roformer, where the frame indices in the original video are kept and directly encoded. This preserves the time flow consistent with the original video, facilitating the model’s perception of slow and fast movements. Experimental results show that our model can generate high-quality captions and outperforms mainstream models, such as HMN and ORG-TRL, on MSVD and MSR-VTT benchmarks.
更多
查看译文
关键词
novel video captioning model,redundant frames,automatical removal
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要