收录:
摘要:
In the context of long document classification (LDC), effectively utilizing multi-modal information encompassing texts and images within these documents has not received adequate attention. This task showcases several notable characteristics. Firstly, the text possesses an implicit or explicit hierarchical structure consisting of sections, sentences, and words. Secondly, the distribution of images is dispersed, encompassing various types such as highly relevant topic images and loosely related reference images. Lastly, intricate and diverse relationships exist between images and text at different levels. To address these challenges, we propose a novel approach called Hierarchical Multi-modal Prompting Transformer (HMPT). Our proposed method constructs the uni-modal and multi-modal transformers at both the section and sentence levels, facilitating effective interaction between features. Notably, we design an adaptive multi-scale multi-modal transformer tailored to capture the multi-granularity correlations between sentences and images. Additionally, we introduce three different types of shared prompts, i.e., shared section, sentence, and image prompts, as bridges connecting the isolated transformers, enabling seamless information interaction across different levels and modalities. To validate the model performance, we conducted experiments on two newly created and two publicly available multi-modal long document datasets. The obtained results show that our method outperforms state-of-the-art single-modality and multi-modality classification methods.
关键词:
通讯作者信息:
电子邮件地址:
来源 :
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
ISSN: 1051-8215
年份: 2024
期: 7
卷: 34
页码: 6376-6390
8 . 4 0 0
JCR@2022
归属院系: