基于词类和搭配的微博舆情文本聚类方法研究-《南京师大学报》（自然科学版）

文章信息/Info

Title:: Research on Text Clustering of Micro-Blog Public Opinion:Word Sense Cluster and Collocation-Based Method

作者:: 王恒静¹; 曹存根²; 高尚¹; (1.江苏科技大学计算机科学与工程学院,江苏镇江 212003)(2.中国科学院计算技术研究所智能信息处理重点实验室,北京 100190)

Author(s):: Wang Hengjing¹; Cao Cungen²; Gao Shang¹; (1.School of Computer Science and Engineering,Jiangsu University of Science and Technology,Zhenjiang 212003,China)(2.Key Laboratory of Intelligent Information Processing,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China)

摘要:: 微博是近年出现的新型互联网信息交流平台,它具有主题分散、体量短小、文体自由等特性,它能对社会产生巨大的影响,所以信息监管部门和商业企业对基于微博信息的舆情分析都有迫切需求. 提出了基于搭配的文本聚类新方法,该方法先进行微博文本预处理,然后利用词类模型进行自动抽取有效搭配,最后基于有效搭配的模型进行文本聚类. 实验证明利用词类文本聚类方法比传统文本聚类方法性能提高6.3%,而本文方法比利用词类文本聚类方法性能提升了16.8%,结果显示了本方法的有效性.

Abstract:: Micro-blog is the new internet information exchange platform emerged recently,which has the features of theme dispersion,short volume,stylistic freedom,and it can have a huge impact on society. So the information supervision department and commercial enterprise have urgent demand for public opinion analysis based on micro-blog information. This paper presents a novel collocation-based method for text clustering. This method conducts micro-blog text preprocessing firstly,and then uses word sense clustering model to extract effective collocation automatically,and effective collocation-based text clustering finally. Experiments proved that the efficiency of the text clustering method using word sense cluster is higher than traditional text clustering method by 6.3%,and the method of this paper has higher rate than the text clustering method using word sense cluster by 16.8%. The result shows the validity of our method.