Convolutional Two-Stream Network Fusion for Video Action Recognition
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(2016)
摘要
Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.
更多查看译文
关键词
convolutional two-stream network fusion,video action recognition,convolutional neural networks,ConvNets,human action recognition,softmax layer,spatial network,temporal network,convolution layer,class prediction layer,abstract convolutional features,spatiotemporal neighbourhoods,ConvNet architecture,spatiotemporal fusion,video snippets
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要