收录:
摘要:
The objective of image captioning involves empowering computers to autonomously produce human-like sentences that depict a provided image. To address the issues of insufficient accuracy in image feature extraction and underutilization of visual information, we propose a Swin Transformer-based image captioning model with feature enhancement and multi-stage fusion. First, the Swin Transformer is employed in the capacity of an encoder for the purpose of extracting image features, and feature enhancement is adopted to capture more information about image features. Then, a multi-stage image and semantic fusion module is constructed to utilize the semantic information from past time steps. Finally, LSTM is used to decode the semantic and image information and generate captions. The proposed model achieves better results in baseline tests on the public datasets Flickr8K and Flickr30K. © 2023 IEEE.
关键词:
通讯作者信息:
电子邮件地址:
来源 :
年份: 2023
语种: 英文
归属院系: