Fast Fourier Inception Networks for Occluded Video Prediction

Ping Li, Chenhan Zhang, Xianghua Xu

IEEE TRANSACTIONS ON MULTIMEDIA(2024)

引用 0|浏览15
暂无评分
摘要
Video prediction is a pixel-level task that generates future frames by employing the historical frames. There often exist continuous complex motions, such as object overlapping and scene occlusion in video, which poses great challenges to this task. Previous works either fail to well capture the long-term temporal dynamics or do not handle the occlusion masks. To address these issues, we develop the fully convolutional Fast Fourier Inception Networks for video prediction, termed FFINet, which includes two primary components, i.e., the occlusion inpainter and the spatiotemporal translator. The former adopts the fast Fourier convolutions to enlarge the receptive field, such that the missing areas (occlusion) with complex geometric structures are filled by the inpainter. The latter employs the stacked Fourier transform inception module to learn the temporal evolution by group convolutions and the spatial movement by channel-wise Fourier convolutions, which captures both the local and the global spatiotemporal features. This encourages generating more realistic and high-quality future frames. To optimize the model, the recovery loss is imposed to the objective, i.e., minimizing the mean square error between the ground-truth frame and the recovery frame. Both quantitative and qualitative experimental results on five benchmarks, including Moving MNIST, TaxiBJ, Human3.6 M, Caltech Pedestrian, and KTH, have demonstrated the superiority of the proposed approach.
更多
查看译文
关键词
Dynamics,Convolutional codes,Task analysis,Spatiotemporal phenomena,Predictive models,Streaming media,Decoding,Video prediction,occlusion,temporal dynamics,inpainting,Fourier transform
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要