收录:
摘要:
Bio-inspired event cameras record a scene as sparse and asynchronous "events" by detecting per-pixel brightness changes. Such cameras show great potential in challenging scene understanding tasks, benefiting from the imaging advantages of high dynamic range and high temporal resolution. Considering the complementarity between event and standard cameras, we propose a multi-modal fusion network (EISNet) to improve the semantic segmentation performance. The key challenges of this topic lie in (i) how to encode event data to represent accurate scene information and (ii) how to fuse multi-modal complementary features by considering the characteristics of two modalities. To solve the first challenge, we propose an Activity-Aware Event Integration Module (AEIM) to convert event data into frame-based representations with high-confidence details via scene activity modeling. To tackle the second challenge, we introduce the Modality Recalibration and Fusion Module (MRFM) to recalibrate modal-specific representations and then aggregate multi-modal features at multiple stages. MRFM learns to generate modal-oriented masks to guide the merging of complementary features, achieving adaptive fusion. Based on these two core designs, our proposed EISNet adopts an encoder-decoder transformer architecture for accurate semantic segmentation using events and images. Experimental results show that our model outperforms state-of-the-art methods by a large margin on event-based semantic segmentation datasets.
关键词:
通讯作者信息:
来源 :
IEEE TRANSACTIONS ON MULTIMEDIA
ISSN: 1520-9210
年份: 2024
卷: 26
页码: 8639-8650
7 . 3 0 0
JCR@2022
归属院系: