收录:
摘要:
Semantic segmentation is a research hotspot in the field of computer vision. It refers to assigning all pixels into different semantic classes. As a fundamental problem in scene understanding, semantic segmentation is widely used in various intelligent tasks. In recent years, with the success of convolutional neural network (CNN) in many computer vision applications, fully convolutional networks (FCN) have shown great potential on RGB semantic segmentation task. However, semantic segmentation is still a challenging task due to the complexity of scene types, severe object occlusions and varying illuminations. In recent years, with the availability of consumer RGB-D sensors such as RealSense 3D Camera and Microsoft Kinect, we can capture both RGB image and depth information at the same time. Depth information can describe 3D geometric information which might be missed in RGB-only images. It can significantly reduce classification errors and improve the accuracy of semantic segmentation. In order to make effective use of RGB information and depth information, it is crucial to find an efficient multi-modal information fusion method. According to different fusion periods, the current RGB-D feature fusion methods can be divided into three types: early fusion, late fusion and middle fusion. However, most of previous studies fail to make effective use of complementary information between RGB information and depth information. They simply fuse RGB features and depth features with equal-weight concatenating or summing, which failed to extract complementary information between two modals and will suppressed the modality specific information. In addition, semantic information in high level features between different modals is not taken into account, which is very important for the fine-grained semantic segmentation task. To solve the above problems, in this paper, we present a novel Attention-aware and Semantic-aware Multi-modal Fusion Network (ASNet) for RGB-D semantic segmentation. Our network is able to effectively fuse multi-level RGB-D features by including Attention-aware Multi-modal Fusion blocks(AMF) and Semantic-aware Multi-modal Fusion blocks(SMF). Specifically, in Attention-aware Multi-modal Fusion blocks, a cross-modal attention mechanism is designed to make RGB features and depth features guide and optimize each other through their complementary characteristics in order to obtain the feature representation with rich spatial location information. In addition, Semantic-aware Multi-modal Fusion blocks model the semantic interdependencies between multi-modal features by integrating semantic associated feature channels among the RGB and depth features and extract more precise semantic feature representation. The two blocks are integrated into a two-branch encoder-decoder architecture, which can restore image resolution gradually by using consecutive up-sampling operation and combine low level features and high level features through skip-connections to achieve high-resolution prediction. In order to optimize the training process, we using deeply supervised learning over multi-level decoding features. Our network is able to effectively learn the complementary characteristics of two modalities and models the semantic context interdependencies between RGB features and depth features. Experimental results with two challenging public RGB-D indoor semantic segmentation datasets, i.e., SUN RGB-D and NYU Depth v2, show that our network outperforms existing RGB-D semantic segmentation methods and improves the segmentation performance by 1.9% and 1.2% for mean accuracy and mean IoU respectively. © 2021, Science Press. All right reserved.
关键词:
通讯作者信息:
电子邮件地址: