A Novel Method of Chinese Text Content Analysis and Mining based on Statistical Models - Details

Author：

Jiao, Kaixiang (Jiao, Kaixiang.)

Indexed by：

EI Scopus

Abstract：

With　the　accumulation　of　various　kinds　of　text　data,　it　is　no　longer　possible　to　generalize　or　classify　them　by　manual　reading,　so　how　to　use　statistical　models　to　mine　text　data　reasonably　and　effectively　has　become　an　important　issue　in　academic　research　and　practical　work.　This　paper　discusses　three　problems　of　Chinese　text　mining:　word　separation,　keyword　extraction　and　text　classification.　For　the　word　separation　problem,　the　Cascaded　Hidden　Markov　Model　and　the　WDM　that　treats　the　segmentation　between　words　as　missing　data　and　solves　it　with　the　EM　algorithm　are　introduced.　For　the　keyword　extraction　problem,　this　paper　proposes　a　Bayes　factor　and　introduces　CCS　using　sparse　regression.　For　the　text　classification　problem,　the　method　of　building　a　classifier　based　on　the　frequency　of　keywords　and　the　method　of　building　a　classifier　based　on　the　probability　of　the　topic　first　are　introduced.　We　give　the　respective　advantages　of　each　method　by　comparing　the　above　methods　with　two　datasets　using　SVM　and　Random　forest,　and　make　suggestions　of　their　use.　©　2023　SPIE.

Keyword：

Text processing Natural language processing systems Big data Learning algorithms Classification (of information) Learning systems Extraction Support vector machines Hidden Markov models Data mining

Author Community：

[ 1 ] [Jiao, Kaixiang]Beijing University of Technology, Beijing; 100124, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

Sentiment classification using the theory of ANNs
2010，Journal of China Universities of Posts and Telecommunications
Topological Data Analysis of Two Cases: Text Classification and Business Customer Relationship Management
2020，2020 4th International Workshop on Advanced Algorithms and Control Engineering, IWAACE 2020
Research on Keyword Extraction Algorithm Using PMI and TextRank
2019，2nd IEEE International Conference on Information and Computer Technologies, ICICT 2019
Multi-label Text Classification with Deep Neural Networks
2018，6th IEEE International Conference on Network Infrastructure and Digital Content, IC-NIDC 2018

Source ：

ISSN： 0277-786X

Year： 2023

Volume： 12597

Language： English

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 4

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Engineering Village

Type
Departments

All Years Choose Year From to