Fair Attention Network for Robust Visual Question Answering - Details

Author：

Bi, Yandong (Bi, Yandong.) | Jiang, Huajie (Jiang, Huajie.) | Hu, Yongli (Hu, Yongli.) | Sun, Yanfeng (Sun, Yanfeng.) | Yin, Baocai (Yin, Baocai.)

Indexed by：

EI Scopus

Abstract：

As　a　prevailing　cross-modal　reasoning　task,　Visual　Question　Answering　(VQA)　has　achieved　impressive　progress　in　the　last　few　years,　where　the　language　bias　is　widely　studied　to　learn　more　robust　VQA　models.　However,　the　visual　bias,　which　also　influences　the　robustness　of　VQA　models,　is　seldomly　considered,　resulting　in　weak　inference　ability.　Therefore,　how　to　balance　the　effect　of　language　bias　and　visual　bias　has　become　essential　in　the　current　VQA　task.　In　this　paper,　we　devise　a　new　reweighting　strategy　taking　both　the　language　bias　and　visual　bias　into　account,　and　propose　a　Fair　Attention　Network　for　Robust　Visual　Question　Answering　(named　as　FAN-VQA).　It　first　constructs　a　question　bias　branch　and　a　visual　bias　branch　to　estimate　the　bias　information　from　two　modalities,　which　are　utilized　to　judge　the　importance　of　samples.　Then,　adaptive　importance　weights　are　learned　from　the　bias　information　and　assigned　to　the　candidate　answers　to　adjust　the　training　losses,　enabling　the　model　to　shift　more　attention　to　the　difficult　samples　that　need　less-salient　visual　clues　to　infer　the　correct　answer.　In　order　to　improve　the　robustness　of　the　VQA　model,　we　design　a　progressive　strategy　to　balance　the　influence　of　original　training　loss　and　adjusted　training　loss.　Extensive　experiments　on　the　VQA-CP　v2,　VQA　v2,　and　VQA-CE　datasets　demonstrate　the　effectiveness　of　the　proposed　FAN-VQA　method.　©　1991-2012　IEEE.

Keyword：

Visual languages Job analysis

Author Community：

[ 1 ] [Bi, Yandong]Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing; 100124, China
[ 2 ] [Jiang, Huajie]Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing; 100124, China
[ 3 ] [Hu, Yongli]Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing; 100124, China
[ 4 ] [Sun, Yanfeng]Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing; 100124, China
[ 5 ] [Yin, Baocai]Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology, Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing; 100124, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

Embedment of 3D virtual human into webpages for visual speech synthesis purpose
2009，2009 IEEE International Conference on Virtual Environments, Human-Computer Interfaces, and Measurements Systems, VECIMS 2009
Teaching object oriented database with Db4o
2011，
Language grounding model: Connecting utterances and visual attributions
2011，4th International Workshop on Advanced Computational Intelligence, IWACI 2011
Roaming system of virtual Olympic gymnasium on creator and Vega Prime
2009，Journal of Beijing University of Technology

Source ：

IEEE Transactions on Circuits and Systems for Video Technology

ISSN： 1051-8215

Year： 2024

Issue： 9

Volume： 34

Page： 7870-7881

8 . 4 0 0

JCR@2022

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count： 7

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 1

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Engineering Village

Type
Departments

All Years Choose Year From to