Swin-Caption: Swin Transformer-Based Image Captioning with Feature Enhancement and Multi-Stage Fusion - Details

Author：

Indexed by：

Scopus

Abstract：

The　objective　of　image　captioning　is　to　empower　computers　to　generate　human-like　sentences　autonomously,　describing　a　provided　image.　To　tackle　the　challenges　of　insufficient　accuracy　in　image　feature　extraction　and　underutilization　of　visual　information,　we　present　a　Swin　Transformer-based　model　for　image　captioning　with　feature　enhancement　and　multi-stage　fusion　(Swin-Caption).　Initially,　the　Swin　Transformer　is　employed　in　the　capacity　of　an　encoder　for　extracting　images,　while　feature　enhancement　is　adopted　to　gather　additional　image　feature　information.　Subsequently,　a　multi-stage　image　and　semantic　fusion　module　is　constructed　to　utilize　the　semantic　information　from　past　time　steps.　Lastly,　a　two-layer　LSTM　is　utilized　to　decode　semantic　and　image　data,　generating　captions.　The　proposed　model　outperforms　the　baseline　model　in　experimental　tests　and　instance　analysis　on　the　public　datasets　Flickr8K,　Flickr30K,　and　MS-COCO.　©　2024　World　Scientific　Publishing　Europe　Ltd.

Keyword：

image captioning swin transformer attention mechanism Deep learning LSTM

Author Community：

[ 1 ] [Liu L.]Faculty of Science, Beijing University of Technology, Beijing, China
[ 2 ] [Liu L.]Beijing Institute for Scientific and Engineering Computing, Beijing University of Technology, Beijing, China
[ 3 ] [Jiao Y.]Faculty of Science, Beijing University of Technology, Beijing, China
[ 4 ] [Li X.]Faculty of Science, Beijing University of Technology, Beijing, China
[ 5 ] [Li J.]Faculty of Science, Beijing University of Technology, Beijing, China
[ 6 ] [Wang H.]Fundamental Standardization, China National Institute of Standardization, Beijing, China
[ 7 ] [Cao X.]Fundamental Standardization, China National Institute of Standardization, Beijing, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

SwinT-SRNet: Swin transformer with image super-resolution reconstruction network for pollen images classification
2024，ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE
X-ray Cherenkov-luminescence tomography reconstruction with a three-component deep learning algorithm: Swin transformer, convolutional neural network, and locality module
2023，JOURNAL OF BIOMEDICAL OPTICS
Swin Transformer-based Image Captioning with Feature Enhancement and Multi-stage Fusion
2023，19th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, ICNC-FSKD 2023
Porn streamer audio recognition based on deep learning and random Forest
2023，APPLIED INTELLIGENCE

Source ：

International Journal of Computational Intelligence and Applications

ISSN： 1469-0268

Year： 2024

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 1

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search SCOPUS

Type
Departments

All Years Choose Year From to