收录:
摘要:
The benchmarking database plays an essential role in evaluating the performance of the touching character string segmentation algorithm. In this paper, we present a new touching Tibetan character strings database. Firstly, using the previous proposed layout analysis and text-line segmentation algorithms, we segment scanned images of historical Tibetan documents into text-line images. Then, we find candidate touching Tibetan character strings using connected component analysis and screen out the correct touching samples. Finally, we annotate the data manually and establish the touching character database. The database contains 5,844 images of two-touching characters and 1,399 images of more than two-touching characters. It is applicable to evaluate the segmentation algorithms for the touching Tibetan character strings. For each image, the annotated ground truth file includes class labels, candidate segment points, baseline and average stroke width of a Tibetan single character. According to the type of touching, we divide the touching character string into three types: AB, OB and BB. We also count the number of different type of samples and find that 76.27% of the samples belongs to the third type (BB). In the end, we measure the performance of the over-segmentation algorithm on this database for reference. © Springer Nature Switzerland AG 2018.
关键词:
通讯作者信息:
电子邮件地址: