Premium
Data mining for text categorization with semi‐supervised agglomerative hierarchical clustering
Author(s) -
Skarmeta Antonio Gómez,
Bensaid Amine,
Tazi Nadia
Publication year - 2000
Publication title -
international journal of intelligent systems
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.291
H-Index - 87
eISSN - 1098-111X
pISSN - 0884-8173
DOI - 10.1002/(sici)1098-111x(200007)15:7<633::aid-int4>3.0.co;2-8
Subject(s) - computer science , hierarchical clustering , cluster analysis , artificial intelligence , single linkage clustering , pattern recognition (psychology) , classifier (uml) , brown clustering , curse of dimensionality , naive bayes classifier , categorization , data mining , feature selection , machine learning , canopy clustering algorithm , correlation clustering , support vector machine
In this paper we study the use of a semi‐supervised agglomerative hierarchical clustering (ssAHC) algorithm to text categorization, which consists of assigning text documents to predefined categories. ssAHC is (i) a clustering algorithm that (ii) uses a finite design set of labeled data to (iii) help agglomerative hierarchical clustering (AHC) algorithms partition a finite set of unlabeled data and then (iv) terminates without the capability to label other objects. We first describe the text representation method we use in this work; we then present a feature selection method that is used to reduce the dimensionality of the feature space. Finally, we apply the ssAHC algorithm to the Reuters database of documents and show that its performance is superior to the Bayes classifier and to the Expectation‐Maximization algorithm combined with Bayes classifier. We showed also that ssAHC helps AHC techniques to improve their performance. © 2000 John Wiley & Sons, Inc.