PMMN: Pre-trained multi-Modal network for scene text recognition - Details

Author：

Zhang, Yu (Zhang, Yu.) | Fu, Zilong (Fu, Zilong.) | Huang, Fuyu (Huang, Fuyu.) | Liu, Yizhi (Liu, Yizhi.)

Indexed by：

EI Scopus SCIE

Abstract：

Scene　Text　Recognition　(STR)　task　needs　to　consume　large-amount　data　to　develop　a　powerful　recognizer,　including　visual　data　like　images　and　linguistic　data　like　texts.　However,　existing　methods　mainly　leverage　a　one-stage　training　manner　to　train　the　entire　framework　end-to-end,　which　deeply　relies　on　the　well-annotated　images　and　does　not　effectively　use　the　data　of　the　two　modalities　mentioned　above.　To　solve　this,　in　this　paper,　we　propose　a　pre-trained　multi-modal　network　(PMMN)　that　utilizes　visual　and　linguistic　data　to　pre-train　the　vision　model　and　language　model　respectively　to　learn　modality-specific　knowledge　for　accurate　scene　text　recognition.　In　detail,　we　first　pre-train　the　proposed　off-the-shelf　vision　model　and　language　model　to　convergence.　And　then,　we　combine　the　pre-trained　models　in　a　unified　framework　for　end-to-end　fine-tuning　and　utilize　the　learned　multi-modal　information　to　interact　with　each　other　to　generate　robust　features　for　character　prediction.　Extensive　experiments　are　conducted　to　demonstrate　the　effectiveness　of　PMMN.　The　evaluation　results　on　six　benchmarks　show　that　our　proposed　method　exceeds　most　existing　methods,　achieving　state-of-the-art　performance.　(c)　2021　Published　by　Elsevier　B.V.

Keyword：

Pre-trained model Scene text recognition Multi-modal information

Author Community：

[ 1 ] [Zhang, Yu]Zhengzhou Normal Univ, Coll Informat Sci & Technol, Zhengzhou 450044, Peoples R China
[ 2 ] [Fu, Zilong]Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230026, Peoples R China
[ 3 ] [Huang, Fuyu]Univ Sci & Technol China, Beijing Res Inst, Beijing 100193, Peoples R China
[ 4 ] [Liu, Yizhi]Hunan Univ Sci & Technol, Xiangtan, Peoples R China
[ 5 ] [Zhang, Yu]Beijing Univ Technol, Beijing 100124, Peoples R China

Reprint Author's Address：

[Fu, Zilong]Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230026, Peoples R China

Email：

JeromeF@mail.ustc.edu.cn

Show more details

Related Keywords：

Integrating multi-modal information to detect spatial domains of spatial transcriptomics by graph attention network
2023，JOURNAL OF GENETICS AND GENOMICS
Human Intention Understanding and Trajectory Planning Based on Multi-modal Data
2023，7th International Conference on Cognitive Systems and Information Processing, ICCSIP 2022
An Improved Multi-modal Data Decision Fusion Method Based on DS Evidence Theory
2020，4th IEEE Information Technology, Networking, Electronic and Automation Control Conference (ITNEC)
An Improved Multi-modal Data Decision Fusion Method Based on DS Evidence Theory
2020，4th IEEE Information Technology, Networking, Electronic and Automation Control Conference, ITNEC 2020

Source ：

PATTERN RECOGNITION LETTERS

ISSN： 0167-8655

Year： 2021

Volume： 151

Page： 103-111

5 . 1 0 0

JCR@2022

ESI Discipline： ENGINEERING;

ESI HC Threshold：87

JCR Journal Grade：2

Cited Count：

WoS CC Cited Count： 8

SCOPUS Cited Count： 15

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 0

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Web of Science

Type
Departments

All Years Choose Year From to