Scale-aware Multi-head Information Aggregation for Image Captioning - Details

Author：

Jia, Aozhe (Jia, Aozhe.) | Zhang, Xiaodan (Zhang, Xiaodan.)

Indexed by：

EI Scopus

Abstract：

Current　mainstream　image　captioning　models　are　based　on　the　encoder-decoder　framework　with　multi-head　attention,　which　commonly　employs　grid　image　features　as　the　input　and　has　shown　superior　performance.　However,　self-attention　in　the　encoder　only　models　the　visual　relations　of　fixed-scale　grid　features,　the　multi-head　attention　mechanism　is　not　fully　exploited　to　capture　diverse　information　for　more　efficient　feature　representation,　thus　affecting　the　quality　of　the　generated　captions.　To　solve　this　problem,　we　propose　a　novel　Scale-aware　Multi-head　Information　Aggregation　(SMIA)　model　for　image　captioning.　SMIA　introduces　multi-scale　visual　features　to　improve　the　feature　representation　from　the　perspective　of　attention　heads.　Specifically,　a　scale　expansion　algorithm　is　proposed　to　extract　multi-scale　visual　features.　Then,　for　different　heads　of　the　multi-head　attention,　different　high-scale　features　are　integrated　into　the　fixed　low-scale　grid　features　to　capture　diverse　and　richer　information.　In　addition,　different　high-scale　features　are　introduced　for　shallow　and　deep　layers　of　encoder　to　further　improve　the　feature　representation.　Besides,　SMIA　is　flexible　to　combine　with　existing　Transformer　models　to　further　improve　performance.　Experimental　results　on　the　MS　COCO　dataset　demonstrate　the　effectiveness　of　our　proposed　SMIA.　©　2024　Technical　Committee　on　Control　Theory,　Chinese　Association　of　Automation.

Keyword：

Image coding Image denoising Photointerpretation Encoding (symbols) Signal encoding

Author Community：

[ 1 ] [Jia, Aozhe]Beijing University of Technology, Beijing; 100124, China
[ 2 ] [Zhang, Xiaodan]Beijing University of Technology, Beijing; 100124, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

A fast H.264 inter-frame prediction algorithm for special mode
2011，Acta Armamentarii
A fast motion estimation algorithm based on motion vector distribution prediction
2013，Journal of Software
A Convolutional Neural Network-Based Complexity Reduction Scheme in 3D-HEVC
2020，6th International Conference on Artificial Intelligence and Security, ICAIS 2020
An improved encoding algorithm for H.264/AVC based on the character of macro-block
2010，

Source ：

ISSN： 1934-1768

Year： 2024

Page： 8771-8777

Language： English

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 1

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Engineering Village

Type
Departments

All Years Choose Year From to