收录:
摘要:
As a prevailing cross-modal reasoning task, Visual Question Answering (VQA) has achieved impressive progress in the last few years, where the language bias is widely studied to learn more robust VQA models. However, the visual bias, which also influences the robustness of VQA models, is seldomly considered, resulting in weak inference ability. Therefore, how to balance the effect of language bias and visual bias has become essential in the current VQA task. In this paper, we devise a new reweighting strategy taking both the language bias and visual bias into account, and propose a Fair Attention Network for Robust Visual Question Answering (named as FAN-VQA). It first constructs a question bias branch and a visual bias branch to estimate the bias information from two modalities, which are utilized to judge the importance of samples. Then, adaptive importance weights are learned from the bias information and assigned to the candidate answers to adjust the training losses, enabling the model to shift more attention to the difficult samples that need less-salient visual clues to infer the correct answer. In order to improve the robustness of the VQA model, we design a progressive strategy to balance the influence of original training loss and adjusted training loss. Extensive experiments on the VQA-CP v2, VQA v2, and VQA-CE datasets demonstrate the effectiveness of the proposed FAN-VQA method. © 1991-2012 IEEE.
关键词:
通讯作者信息:
电子邮件地址:
来源 :
IEEE Transactions on Circuits and Systems for Video Technology
ISSN: 1051-8215
年份: 2024
期: 9
卷: 34
页码: 7870-7881
8 . 4 0 0
JCR@2022
归属院系: