收录:
摘要:
Scene Text Recognition (STR) task needs to consume large-amount data to develop a powerful recognizer, including visual data like images and linguistic data like texts. However, existing methods mainly leverage a one-stage training manner to train the entire framework end-to-end, which deeply relies on the well-annotated images and does not effectively use the data of the two modalities mentioned above. To solve this, in this paper, we propose a pre-trained multi-modal network (PMMN) that utilizes visual and linguistic data to pre-train the vision model and language model respectively to learn modality-specific knowledge for accurate scene text recognition. In detail, we first pre-train the proposed off-the-shelf vision model and language model to convergence. And then, we combine the pre-trained models in a unified framework for end-to-end fine-tuning and utilize the learned multi-modal information to interact with each other to generate robust features for character prediction. Extensive experiments are conducted to demonstrate the effectiveness of PMMN. The evaluation results on six benchmarks show that our proposed method exceeds most existing methods, achieving state-of-the-art performance. (c) 2021 Published by Elsevier B.V.
关键词:
通讯作者信息:
电子邮件地址:
来源 :
PATTERN RECOGNITION LETTERS
ISSN: 0167-8655
年份: 2021
卷: 151
页码: 103-111
5 . 1 0 0
JCR@2022
ESI学科: ENGINEERING;
ESI高被引阀值:87
JCR分区:2
归属院系: