收录:
摘要:
Recently, the Convolutional Networks (ConvNet) has become the dominated approach to the human activity classification problem. We investigate current standard ConvNet architectures and pinpoint one of their main limitations: the spatial-temporal dependency is simply captured by global pooling operation, which may not well capture the complex long term spatial-temporal relationships in videos. For this work, we propose a Spatial Temporal Attentional Glimpse (STAG) module to overcome this shortcoming. Specifically, the input to this STAG module is a 3D tensor which is first processed by a spatial-temporal attention block. Spatial Temporal Glimpse block decomposes the resulting tensor into two low dimensional tensors and then fuses their operation results. The proposed STAG module is pluggable, easy to learn, and effective in computation. We conduct extended ablation studies to show that our model incorporated with the STAG block substantially improves the performance over the state-of-the-art. All the experimental results, the trained models, and the complete source codes will be released to facilitate further studies on this problem.
关键词:
通讯作者信息:
电子邮件地址:
来源 :
2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP)
ISSN: 1522-4880
年份: 2019
页码: 4040-4044
语种: 英文
归属院系: