Premium
Selective sampling for approximate clustering of very large data sets
Author(s) -
Wang Liang,
Bezdek James C.,
Leckie Christopher,
Kotagiri Ramamohanarao
Publication year - 2008
Publication title -
international journal of intelligent systems
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.291
H-Index - 87
eISSN - 1098-111X
pISSN - 0884-8173
DOI - 10.1002/int.20268
Subject(s) - cluster analysis , sampling (signal processing) , fuzzy clustering , computer science , algorithm , data mining , relational database , correlation clustering , set (abstract data type) , mathematics , pattern recognition (psychology) , artificial intelligence , filter (signal processing) , computer vision , programming language
A key challenge in pattern recognition is how to scale the computational efficiency of clustering algorithms on large data sets. The extension of non‐Euclidean relational fuzzy c‐means (NERF) clustering to very large (VL = unloadable) relational data is called the extended NERF (eNERF) clustering algorithm, which comprises four phases: (i) finding distinguished features that monitor progressive sampling; (ii) progressively sampling from a N × N relational matrix R N to obtain a n × n sample matrix R n ; (iii) clustering R n with literal NERF; and (iv) extending the clusters in R n to the remainder of the relational data. Previously published examples on several fairly small data sets suggest that eNERF is feasible for truly large data sets. However, it seems that phases (i) and (ii), i.e., finding R n , are not very practical because the sample size n often turns out to be roughly 50% of n , and this over‐sampling defeats the whole purpose of eNERF. In this paper, we examine the performance of the sampling scheme of eNERF with respect to different parameters. We propose a modified sampling scheme for use with eNERF that combines simple random sampling with (parts of) the sampling procedures used by eNERF and a related algorithm sVAT (scalable visual assessment of clustering tendency). We demonstrate that our modified sampling scheme can eliminate over‐sampling of the original progressive sampling scheme, thus enabling the processing of truly VL data. Numerical experiments on a distance matrix of a set of 3,000,000 vectors drawn from a mixture of 5 bivariate normal distributions demonstrate the feasibility and effectiveness of the proposed sampling method. We also find that actually running eNERF on a data set of this size is very costly in terms of computation time. Thus, our results demonstrate that further modification of eNERF, especially the extension stage, will be needed before it is truly practical for VL data. © 2008 Wiley Periodicals, Inc.