• 综合
  • 标题
  • 关键词
  • 摘要
  • 学者
  • 期刊-刊名
  • 期刊-ISSN
  • 会议名称
搜索

作者:

Liu, Tengfei (Liu, Tengfei.) | Hu, Yongli (Hu, Yongli.) | Gao, Junbin (Gao, Junbin.) | Sun, Yanfeng (Sun, Yanfeng.) | Yin, Baocai (Yin, Baocai.) (学者:尹宝才)

收录:

EI Scopus SCIE

摘要:

In the context of long document classification (LDC), effectively utilizing multi-modal information encompassing texts and images within these documents has not received adequate attention. This task showcases several notable characteristics. Firstly, the text possesses an implicit or explicit hierarchical structure consisting of sections, sentences, and words. Secondly, the distribution of images is dispersed, encompassing various types such as highly relevant topic images and loosely related reference images. Lastly, intricate and diverse relationships exist between images and text at different levels. To address these challenges, we propose a novel approach called Hierarchical Multi-modal Prompting Transformer (HMPT). Our proposed method constructs the uni-modal and multi-modal transformers at both the section and sentence levels, facilitating effective interaction between features. Notably, we design an adaptive multi-scale multi-modal transformer tailored to capture the multi-granularity correlations between sentences and images. Additionally, we introduce three different types of shared prompts, i.e., shared section, sentence, and image prompts, as bridges connecting the isolated transformers, enabling seamless information interaction across different levels and modalities. To validate the model performance, we conducted experiments on two newly created and two publicly available multi-modal long document datasets. The obtained results show that our method outperforms state-of-the-art single-modality and multi-modality classification methods.

关键词:

prompt learning Feature extraction Adaptation models Task analysis adaptive multi-scale multi-modal transformer Visualization Computational modeling multi-modal transformer Transformers Circuits and systems Multi-modal long document classification

作者机构:

  • [ 1 ] [Liu, Tengfei]Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
  • [ 2 ] [Liu, Tengfei]Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing, NSW 2006, Peoples R China
  • [ 3 ] [Hu, Yongli]Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
  • [ 4 ] [Hu, Yongli]Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing, NSW 2006, Peoples R China

通讯作者信息:

  • [Hu, Yongli]Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China;;

查看成果更多字段

相关关键词:

相关文章:

来源 :

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

ISSN: 1051-8215

年份: 2024

期: 7

卷: 34

页码: 6376-6390

8 . 4 0 0

JCR@2022

被引次数:

WoS核心集被引频次:

SCOPUS被引频次: 6

ESI高被引论文在榜: 0 展开所有

万方被引频次:

中文被引频次:

近30日浏览量: 1

归属院系:

在线人数/总访问数:415/4951582
地址:北京工业大学图书馆(北京市朝阳区平乐园100号 邮编:100124) 联系我们:010-67392185
版权所有:北京工业大学图书馆 站点建设与维护:北京爱琴海乐之技术有限公司