Convolutional Two-Stream Network Fusion for Video Action Recognition

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2016)

引用 3304|浏览337
暂无评分
摘要
Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.
更多
查看译文
关键词
convolutional two-stream network fusion,video action recognition,convolutional neural networks,ConvNets,human action recognition,softmax layer,spatial network,temporal network,convolution layer,class prediction layer,abstract convolutional features,spatiotemporal neighbourhoods,ConvNet architecture,spatiotemporal fusion,video snippets
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要