Convolutional Networks With Channel and STIPs Attention Model for Action Recognition in Videos

IEEE Transactions on Multimedia(2020)

引用 23|浏览48
暂无评分
摘要
With the help of convolutional neural networks (CNNs), video-based human action recognition has made significant progress. CNN features that are spatial and channel-wise can provide rich information for powerful image description. However, CNNs lack the ability to process the long-term temporal dependency of an entire video and further cannot well focus on the informative motion regions of actions. Aiming at the two problems, we propose a novel video-based action recognition framework in this paper. We first represent videos with dynamic image sequences (DISs), which effectively describe videos by modeling the local spatial-temporal dynamics and dependencies. Then a channel and spatial-temporal interest points (STIPs) attention model (CSAM) based on CNNs is proposed to focus on the discriminative channels in networks and the informative spatial motion regions of human actions. Specifically, channel attention (CA) is implemented by automatically learning channel-wise convolutional features and assigning different weights for different channels. STIPs attention (SA) is encoded by projecting the detected STIPs on frames of dynamic image sequences into the corresponding convolutional feature map space. The proposed CSAM is embedded after CNN convolutional layers to refine the feature maps, followed by global average pooling to produce effective feature representations for videos. Finally frame-level video representations are fed into an LSTM to capture the temporal dependencies and make classification. Experiments on three challenging RGB-D datasets show that our method has better performance and outperforms the state-of-the-art approaches with only depth data.
更多
查看译文
关键词
Action recognition,dynamic image sequence,channel attention,spatial-temporal interest points (STIPs) attention,deep networks
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要