Breaking the Barrier Between Pre-training and Fine-tuning: A Hybrid Prompting Model for Knowledge-Based VQA - Details

Author：

Indexed by：

CPCI-S EI Scopus

Abstract：

Considerable　performance　gains　have　been　achieved　for　knowledge-based　visual　question　answering　due　to　the　visual-language　pretraining　models　with　pre-training-then-fine-tuning　paradigm.　However,　because　the　targets　of　the　pre-training　and　fine-tuning　stages　are　different,　there　is　an　evident　barrier　that　prevents　the　cross-modal　comprehension　ability　developed　in　the　pre-training　stage　from　fully　endowing　the　fine-tuning　task.　To　break　this　barrier,　in　this　paper,　we　propose　a　novel　hybrid　prompting　model　for　knowledge-based　VQA,　which　inherits　and　incorporates　the　pre-training　and　fine-tuning　tasks　with　a　shared　objective.　Specifically,　based　on　static　declaration　prompt,　we　construct　a　consistent　goal　with　the　fine-tuning　via　masked　language　modeling　to　inherit　capabilities　of　pre-training　task,　while　selecting　the　top-t　relevant　knowledge　in　a　dense　retrieval　manner.　Additionally,　a　dynamic　knowledge　prompt　is　learned　from　retrieved　knowledge,　which　not　only　alleviates　the　length　constraint　on　inputs　for　visual-language　pre-trained　models　but　also　assists　in　providing　answer　features　via　fine-tuning.　Combining　and　unifying　the　aims　of　the　two　stages　could　fully　exploit　the　abilities　of　pre-training　and　fine-tuning　to　predict　answer.　We　evaluate　the　proposed　model　on　the　OKVQA　dataset,　and　the　result　shows　that　our　model　outperforms　the　state-of-the-art　methods　based　on　visual-language　pre-training　models　with　a　noticeable　performance　gap　and　even　exceeds　the　largescale　language　model　of　GPT-3,　which　proves　the　benefits　of　the　hybrid　prompts　and　the　advantages　of　unifying　pre-training　to　fine-tuning.

Keyword：

Knowledge Integration Visual Question Answering Multi-modal Fusion

Author Community：

[ 1 ] [Sun, Zhongfan]Beijing Univ Technol, Beijing, Peoples R China
[ 2 ] [Hu, Yongli]Beijing Univ Technol, Beijing, Peoples R China
[ 3 ] [Gao, Qingqing]Beijing Univ Technol, Beijing, Peoples R China
[ 4 ] [Jiang, Huajie]Beijing Univ Technol, Beijing, Peoples R China
[ 5 ] [Sun, Yanfeng]Beijing Univ Technol, Beijing, Peoples R China
[ 6 ] [Yin, Baocai]Beijing Univ Technol, Beijing, Peoples R China
[ 7 ] [Gao, Junbin]Univ Sydney, Sydney, NSW, Australia

Reprint Author's Address：

Email：

Show more details

Related Keywords：

VIG: Visual Information-Guided Knowledge-Based Visual Question Answering
2024，PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024
See and Learn More: Dense Caption-Aware Representation for Visual Question Answering
2024，IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
VQA-PDF: Purifying Debiased Features for Robust Visual Question Answering Task
2024，ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XII, ICIC 2024
Bridging the Cross-Modality Semantic Gap in Visual Question Answering
2024，IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Source ：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023

Year： 2023

Page： 4065-4073

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count： 7

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 0

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Web of Science

Type
Departments

All Years Choose Year From to