
A Novel Text Ensemble Clustering Based on Weighted Entropy Filtering Model
Author(s) -
Qiaoyun Shen,
Yican Qiu
Publication year - 2021
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/2024/1/012045
Subject(s) - cluster analysis , correlation clustering , data mining , cure data clustering algorithm , single linkage clustering , computer science , k medians clustering , entropy (arrow of time) , fuzzy clustering , artificial intelligence , consensus clustering , data stream clustering , pattern recognition (psychology) , mathematics , quantum mechanics , physics
Text clustering is one of the important technical bases of natural language processing, and ensemble clustering improves the robustness of text clustering. According to the existing research of scholars and experts, the quality and diversity of basic clustering have a great influence on consensus clustering, and it has a particularly significant effect on text clustering. However, there are a few pieces of research aiming at reducing the number of low-quality clustering in ensembles. This paper proposes a novel clustering filtering model based on entropy criteria. The entropy criterion is used to evaluate the uncertainty of each cluster w.r.t. the ensemble. Two indexes are proposed on the basis of the uncertainty of cluster, namely, Clustering Trend Index (CTI) which indicates the contribution of each cluster w.r.t. basic clustering, and Cluster Consistency Index(CCI) which indicates the degree of cluster dispersion in the basic clustering. The proposed clustering filtering model is built on the basis of new weight using two proposed indexes. Thereby, by dropping the low-quality clustering, the percentage of high-quality clustering will increase. A large number of experiments on various real text data sets using optimal thresholds show that the proposed method has greatly improved accuracy and robustness, and is superior to existing ensemble clustering algorithms.