收录:
摘要:
Effectively modeling spatio-temporal information in the videos is the key to improving the performance of action recognition. In this work, we propose 3D residual networks with channel and spatial attention modules for action recognition. The proposed network architecture can directly extract spatiotemporal features. Channel attention module and spatial attention module can effectively assist the network to learn what and where to emphasize or suppress, at virtually negligible increase in computation cost. Specifically, we sequentially add channel attention module and spatial attention module to each slice tensor of the intermediate feature map to form channel and spatial attention maps. Then the attention maps are multiplied to the input feature map to reweight important features. We validate our network through extensive experiments and visualization method on the datasets of HMDB-51 and UCF-101.
关键词:
通讯作者信息:
来源 :
2020 CHINESE AUTOMATION CONGRESS (CAC 2020)
ISSN: 2688-092X
年份: 2020
页码: 5171-5174
语种: 英文