Sequence to Sequence - Video to Text

ICCV(2015)

引用 1716|浏览424
暂无评分
摘要
Real-world videos often have complex dynamics, methods for generating open-domain video descriptions should be senstive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem we propose a novel end-to-end sequence-to-sequence model to generate captions for videos. For this we exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation. Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip. Our model naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model. We evaluate several variants of our model that exploit different visual features on a standard set of YouTube videos and two movie description datasets (M-VAD and MPII-MD).
更多
查看译文
关键词
real-world videos,open-domain video descriptions,temporal structure,variable length input,variable length output,frame sequence,word sequence,end-to-end sequence-to-sequence model,video captions,recurrent neural networks,image caption generation,LSTM model,video-sentence pairs,video frame sequence,video clip,temporal structure learning,language model,visual features,YouTube videos,movie description dataset,M-VAD dataset,MPII-MD dataset,sequence-to-sequence video-to-text approach,S2VT approach
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要