Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and Beyond - Details

Author：

Chen, Hao (Chen, Hao.) | Shen, Feihong (Shen, Feihong.) | Ding, Ding (Ding, Ding.) | Deng, Yongjian (Deng, Yongjian.) | Li, Chao (Li, Chao.)

Indexed by：

EI Scopus SCIE

Abstract：

Previous　multi-modal　transformers　for　RGB-D　salient　object　detection　(SOD)　generally　directly　connect　all　patches　from　two　modalities　to　model　cross-modal　correlation　and　perform　multi-modal　combination　without　differentiation,　which　can　lead　to　confusing　and　inefficient　fusion.　Instead,　we　disentangle　the　cross-modal　complementarity　from　two　views　to　reduce　cross-modal　fusion　ambiguity:　1)　Context　disentanglement.　We　argue　that　modeling　long-range　dependencies　across　modalities　as　done　before　is　uninformative　due　to　the　severe　modality　gap.　Differently,　we　propose　to　disentangle　the　cross-modal　complementary　contexts　to　intra-modal　self-attention　to　explore　global　complementary　understanding,　and　spatial-aligned　inter-modal　attention　to　capture　local　cross-modal　correlations,　respectively.　2)　Representation　disentanglement.　Unlike　previous　undifferentiated　combination　of　cross-modal　representations,　we　find　that　cross-modal　cues　complement　each　other　by　enhancing　common　discriminative　regions　and　mutually　supplement　modal-specific　highlights.　On　top　of　this,　we　divide　the　tokens　into　consistent　and　private　ones　in　the　channel　dimension　to　disentangle　the　multi-modal　integration　path　and　explicitly　boost　two　complementary　ways.　By　progressively　propagate　this　strategy　across　layers,　the　proposed　Disentangled　Feature　Pyramid　module　(DFP)　enables　informative　cross-modal　cross-level　integration　and　better　fusion　adaptivity.　Comprehensive　experiments　on　a　large　variety　of　public　datasets　verify　the　efficacy　of　our　context　and　representation　disentanglement　and　the　consistent　improvement　over　state-of-the-art　models.　Additionally,　our　cross-modal　attention　hierarchy　can　be　plug-and-play　for　different　backbone　architectures　(both　transformer　and　CNN)　and　downstream　tasks,　and　experiments　on　a　CNN-based　model　and　RGB-D　semantic　segmentation　verify　this　generalization　ability.

Keyword：

disentanglement Transformers Context modeling Object detection RGB-D salient object detection Computer architecture transformer cross-modal attention Task analysis Computational modeling Feature extraction

Author Community：

[ 1 ] [Chen, Hao]Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China
[ 2 ] [Shen, Feihong]Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China
[ 3 ] [Ding, Ding]Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China
[ 4 ] [Chen, Hao]Southeast Univ, Key Lab New Generat Artificial Intelligence Techno, Minist Educ, Nanjing 211189, Peoples R China
[ 5 ] [Shen, Feihong]Southeast Univ, Key Lab New Generat Artificial Intelligence Techno, Minist Educ, Nanjing 211189, Peoples R China
[ 6 ] [Deng, Yongjian]Beijing Univ Technol, Coll Comp Sci, Beijing 100124, Peoples R China
[ 7 ] [Li, Chao]Alibaba Grp, Hangzhou 311121, Peoples R China

Reprint Author's Address：

[Chen, Hao]Southeast Univ, Sch Comp Sci & Engn, Nanjing 211189, Peoples R China;;

Email：

haochen303@seu.edu.cn |
feihongshen@seu.edu.cn |
dingding-1@seu.edu.cn |
yjdeng@bjut.edu.cn |
lllcho.lc@alibaba-inc.com

Show more details

Related Keywords：

Context-aware network for RGB-D salient object detection
2021，PATTERN RECOGNITION
A deep multimodal feature learning network for RGB-D salient object detection
2021，COMPUTERS & ELECTRICAL ENGINEERING
Cascade Transformer Decoder Based Occluded Pedestrian Detection With Dynamic Deformable Convolution and Gaussian Projection Channel Attention Mechanism
2023，IEEE TRANSACTIONS ON MULTIMEDIA
Hierarchical Multi-Modal Prompting Transformer for Multi-Modal Long Document Classification
2024，IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

Source ：

IEEE TRANSACTIONS ON IMAGE PROCESSING

ISSN： 1057-7149

Year： 2024

Volume： 33

Page： 1699-1709

1 0 . 6 0 0

JCR@2022

Cited Count：

WoS CC Cited Count： 12

SCOPUS Cited Count： 23

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 1

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Web of Science

Type
Departments

All Years Choose Year From to