Bridging the Cross-Modality Semantic Gap in Visual Question Answering - Details

Author：

Indexed by：

Scopus SCIE

Abstract：

The　objective　of　visual　question　answering　(VQA)　is　to　adequately　comprehend　a　question　and　identify　relevant　contents　in　an　image　that　can　provide　an　answer.　Existing　approaches　in　VQA　often　combine　visual　and　question　features　directly　to　create　a　unified　cross-modality　representation　for　answer　inference.　However,　this　kind　of　approach　fails　to　bridge　the　semantic　gap　between　visual　and　text　modalities,　resulting　in　a　lack　of　alignment　in　cross-modality　semantics　and　the　inability　to　match　key　visual　content　accurately.　In　this　article,　we　propose　a　model　called　the　caption　bridge-based　cross-modality　alignment　and　contrastive　learning　model　(CBAC)　to　address　the　issue.　The　CBAC　model　aims　to　reduce　the　semantic　gap　between　different　modalities.　It　consists　of　a　caption-based　cross-modality　alignment　module　and　a　visual-caption　(V-C)　contrastive　learning　module.　By　utilizing　an　auxiliary　caption　that　shares　the　same　modality　as　the　question　and　has　closer　semantic　associations　with　the　visual,　we　are　able　to　effectively　reduce　the　semantic　gap　by　separately　matching　the　caption　with　both　the　question　and　the　visual　to　generate　pre-alignment　features　for　each,　which　are　then　used　in　the　subsequent　fusion　process.　We　also　leverage　the　fact　that　V-C　pairs　exhibit　stronger　semantic　connections　compared　to　question-visual　(Q-V)　pairs　to　employ　a　contrastive　learning　mechanism　on　visual　and　caption　pairs　to　further　enhance　the　semantic　alignment　capabilities　of　single-modality　encoders.　Extensive　experiments　conducted　on　three　benchmark　datasets　demonstrate　that　the　proposed　model　outperforms　previous　state-of-the-art　VQA　models.　Additionally,　ablation　experiments　confirm　the　effectiveness　of　each　module　in　our　model.　Furthermore,　we　conduct　a　qualitative　analysis　by　visualizing　the　attention　matrices　to　assess　the　reasoning　reliability　of　the　proposed　model.

Keyword：

cross-modality analysis Caption bridge visual question answering (VQA) contrastive learning

Author Community：

[ 1 ] [Wang, Boyue]Beijing Univ Technol, Beijing Artificial Intelligence Inst, Fac Informat Technol, Beijing Municipal Key Lab Multimedia & Intelligent, Beijing 100124, Peoples R China
[ 2 ] [Ma, Yujian]Beijing Univ Technol, Beijing Artificial Intelligence Inst, Fac Informat Technol, Beijing Municipal Key Lab Multimedia & Intelligent, Beijing 100124, Peoples R China
[ 3 ] [Li, Xiaoyan]Beijing Univ Technol, Beijing Artificial Intelligence Inst, Fac Informat Technol, Beijing Municipal Key Lab Multimedia & Intelligent, Beijing 100124, Peoples R China
[ 4 ] [Hu, Yongli]Beijing Univ Technol, Beijing Artificial Intelligence Inst, Fac Informat Technol, Beijing Municipal Key Lab Multimedia & Intelligent, Beijing 100124, Peoples R China
[ 5 ] [Yin, Baocai]Beijing Univ Technol, Beijing Artificial Intelligence Inst, Fac Informat Technol, Beijing Municipal Key Lab Multimedia & Intelligent, Beijing 100124, Peoples R China

Reprint Author's Address：

[Li, Xiaoyan]Beijing Univ Technol, Beijing Artificial Intelligence Inst, Fac Informat Technol, Beijing Municipal Key Lab Multimedia & Intelligent, Beijing 100124, Peoples R China

Email：

Show more details

Related Keywords：

Contrastive Visual-Question-Caption Counterfactuals on Biased Samples for Visual Question Answering
2024，43rd Chinese Control Conference, CCC 2024
VQA-PDF: Purifying Debiased Features for Robust Visual Question Answering Task
2024，ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XII, ICIC 2024
Self-supervised knowledge distillation in counterfactual learning for VQA
2023，PATTERN RECOGNITION LETTERS
面向跨模态数据协同分析的视觉问答方法综述
2022，北京工业大学学报

Source ：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

ISSN： 2162-237X

Year： 2024

1 0 . 4 0 0

JCR@2022

Cited Count：

WoS CC Cited Count： 3

SCOPUS Cited Count： 3

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 1

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Web of Science

Type
Departments

All Years Choose Year From to