• Complex
  • Title
  • Keyword
  • Abstract
  • Scholars
  • Journal
  • ISSN
  • Conference
搜索

Author:

Sun, Zhongfan (Sun, Zhongfan.) | Hu, Yongli (Hu, Yongli.) | Gao, Qingqing (Gao, Qingqing.) | Jiang, Huajie (Jiang, Huajie.) | Gao, Junbin (Gao, Junbin.) | Sun, Yanfeng (Sun, Yanfeng.) | Yin, Baocai (Yin, Baocai.) (Scholars:尹宝才)

Indexed by:

CPCI-S EI Scopus

Abstract:

Considerable performance gains have been achieved for knowledge-based visual question answering due to the visual-language pretraining models with pre-training-then-fine-tuning paradigm. However, because the targets of the pre-training and fine-tuning stages are different, there is an evident barrier that prevents the cross-modal comprehension ability developed in the pre-training stage from fully endowing the fine-tuning task. To break this barrier, in this paper, we propose a novel hybrid prompting model for knowledge-based VQA, which inherits and incorporates the pre-training and fine-tuning tasks with a shared objective. Specifically, based on static declaration prompt, we construct a consistent goal with the fine-tuning via masked language modeling to inherit capabilities of pre-training task, while selecting the top-t relevant knowledge in a dense retrieval manner. Additionally, a dynamic knowledge prompt is learned from retrieved knowledge, which not only alleviates the length constraint on inputs for visual-language pre-trained models but also assists in providing answer features via fine-tuning. Combining and unifying the aims of the two stages could fully exploit the abilities of pre-training and fine-tuning to predict answer. We evaluate the proposed model on the OKVQA dataset, and the result shows that our model outperforms the state-of-the-art methods based on visual-language pre-training models with a noticeable performance gap and even exceeds the largescale language model of GPT-3, which proves the benefits of the hybrid prompts and the advantages of unifying pre-training to fine-tuning.

Keyword:

Knowledge Integration Visual Question Answering Multi-modal Fusion

Author Community:

  • [ 1 ] [Sun, Zhongfan]Beijing Univ Technol, Beijing, Peoples R China
  • [ 2 ] [Hu, Yongli]Beijing Univ Technol, Beijing, Peoples R China
  • [ 3 ] [Gao, Qingqing]Beijing Univ Technol, Beijing, Peoples R China
  • [ 4 ] [Jiang, Huajie]Beijing Univ Technol, Beijing, Peoples R China
  • [ 5 ] [Sun, Yanfeng]Beijing Univ Technol, Beijing, Peoples R China
  • [ 6 ] [Yin, Baocai]Beijing Univ Technol, Beijing, Peoples R China
  • [ 7 ] [Gao, Junbin]Univ Sydney, Sydney, NSW, Australia

Reprint Author's Address:

Show more details

Related Keywords:

Related Article:

Source :

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023

Year: 2023

Page: 4065-4073

Cited Count:

WoS CC Cited Count:

SCOPUS Cited Count: 5

ESI Highly Cited Papers on the List: 0 Unfold All

WanFang Cited Count:

Chinese Cited Count:

30 Days PV: 2

Affiliated Colleges:

Online/Total:566/5287733
Address:BJUT Library(100 Pingleyuan,Chaoyang District,Beijing 100124, China Post Code:100124) Contact Us:010-67392185
Copyright:BJUT Library Technical Support:Beijing Aegean Software Co., Ltd.