基于改进的Simhash算法的相似文档识别技术 - Details

Author：

张兴兰 (张兴兰.) | 何丹丹 (何丹丹.)

Indexed by：

CQVIP

Abstract：

[目的/意义]：为了实现在海量文本中更加高效准确检测出相似文本。[方法]：本文对基于Simhash算法的相似文档识别技术进行研究改进，对Simhash签名值的计算方法作出改进，分词阶段使用ICTCLAS分词系统，文本特征词的权重计算方法采用TF-IDF技术，同时将特征词的词性、词长、是否为标志词与是否被包含在标题中几大方面作为权重计算的考虑因素。最后使用汉明距离对文档签名值进行比较，从海量文档中精确地找出相似文档。[结论]：通过改进TF-IDF权重，使得改进的Simhash算法在相似文档识别准确率上优于其他算法。

Keyword：

指纹计算 TF-IDF算法相似文档检测 Fingerprint Calculation Similar Document Detection Simhash算法 Simhash Algorithm Hamming Distance TF-IDF Algorithm 汉明距离

Author Community：

[ 1 ] [张兴兰]北京工业大学，北京
[ 2 ] [何丹丹]北京工业大学，北京

Reprint Author's Address：

Email：

Show more details

Related Keywords：

Optimized TF-IDF Algorithm with the Adaptive Weight of Position of Word
2016，2nd International Conference on Artificial Intelligence and Industrial Engineering (AIIE)
基于卡方统计改进的TF-IDF的文本分类的研究
2019，电子世界
Research on keyword extraction and sentiment orientation analysis of educational texts
2017，Journal of Computers (Taiwan)
Interdisciplinary Attribute Evaluation of Postgraduate Supervisors in Beijing University of Technology
2022，4th International Conference on Advanced Information Science and System, AISS 2022

Source ：

计算机科学与应用

ISSN： 2161-8801

Year： 2020

Issue： 02

Volume： 10

Page： 371-378

Cited Count：

WoS CC Cited Count： 0

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count： -1

Chinese Cited Count：

30 Days PV： 0

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search WF

Type
Departments

All Years Choose Year From to