• 综合
  • 标题
  • 关键词
  • 摘要
  • 学者
  • 期刊-刊名
  • 期刊-ISSN
  • 会议名称
搜索

作者:

Chen, Hao (Chen, Hao.) | Shen, Feihong (Shen, Feihong.) | Ding, Ding (Ding, Ding.) | Deng, Yongjian (Deng, Yongjian.) | Li, Chao (Li, Chao.)

收录:

EI Scopus SCIE

摘要:

Previous multi-modal transformers for RGB-D salient object detection (SOD) generally directly connect all patches from two modalities to model cross-modal correlation and perform multi-modal combination without differentiation, which can lead to confusing and inefficient fusion. Instead, we disentangle the cross-modal complementarity from two views to reduce cross-modal fusion ambiguity: 1) Context disentanglement. We argue that modeling long-range dependencies across modalities as done before is uninformative due to the severe modality gap. Differently, we propose to disentangle the cross-modal complementary contexts to intra-modal self-attention to explore global complementary understanding, and spatial-aligned inter-modal attention to capture local cross-modal correlations, respectively. 2) Representation disentanglement. Unlike previous undifferentiated combination of cross-modal representations, we find that cross-modal cues complement each other by enhancing common discriminative regions and mutually supplement modal-specific highlights. On top of this, we divide the tokens into consistent and private ones in the channel dimension to disentangle the multi-modal integration path and explicitly boost two complementary ways. By progressively propagate this strategy across layers, the proposed Disentangled Feature Pyramid module (DFP) enables informative cross-modal cross-level integration and better fusion adaptivity. Comprehensive experiments on a large variety of public datasets verify the efficacy of our context and representation disentanglement and the consistent improvement over state-of-the-art models. Additionally, our cross-modal attention hierarchy can be plug-and-play for different backbone architectures (both transformer and CNN) and downstream tasks, and experiments on a CNN-based model and RGB-D semantic segmentation verify this generalization ability.

关键词:

disentanglement Transformers Context modeling Object detection RGB-D salient object detection Computer architecture transformer cross-modal attention Task analysis Computational modeling Feature extraction

作者机构:

  • [ 1 ] [Chen, Hao]Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China
  • [ 2 ] [Shen, Feihong]Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China
  • [ 3 ] [Ding, Ding]Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China
  • [ 4 ] [Chen, Hao]Southeast Univ, Key Lab New Generat Artificial Intelligence Techno, Minist Educ, Nanjing 211189, Peoples R China
  • [ 5 ] [Shen, Feihong]Southeast Univ, Key Lab New Generat Artificial Intelligence Techno, Minist Educ, Nanjing 211189, Peoples R China
  • [ 6 ] [Deng, Yongjian]Beijing Univ Technol, Coll Comp Sci, Beijing 100124, Peoples R China
  • [ 7 ] [Li, Chao]Alibaba Grp, Hangzhou 311121, Peoples R China

通讯作者信息:

  • [Chen, Hao]Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China;;

查看成果更多字段

相关关键词:

来源 :

IEEE TRANSACTIONS ON IMAGE PROCESSING

ISSN: 1057-7149

年份: 2024

卷: 33

页码: 1699-1709

1 0 . 6 0 0

JCR@2022

被引次数:

WoS核心集被引频次: 12

SCOPUS被引频次: 14

ESI高被引论文在榜: 0 展开所有

万方被引频次:

中文被引频次:

近30日浏览量: 1

归属院系:

在线人数/总访问数:533/4950834
地址:北京工业大学图书馆(北京市朝阳区平乐园100号 邮编:100124) 联系我们:010-67392185
版权所有:北京工业大学图书馆 站点建设与维护:北京爱琴海乐之技术有限公司