Selective sampling for approximate clustering of very large data sets | Zendy

Wang Liang | Zendy; Bezdek James C. | Zendy; Leckie Christopher | Zendy; Kotagiri Ramamohanarao | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Selective sampling for approximate clustering of very large data sets

Author(s) -

Wang Liang,

Bezdek James C.,

Leckie Christopher,

Kotagiri Ramamohanarao

Publication year - 2008

Publication title -

international journal of intelligent systems

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 1.291

H-Index - 87

eISSN - 1098-111X

pISSN - 0884-8173

DOI - 10.1002/int.20268

Subject(s) - cluster analysis , sampling (signal processing) , fuzzy clustering , computer science , algorithm , data mining , relational database , correlation clustering , set (abstract data type) , mathematics , pattern recognition (psychology) , artificial intelligence , filter (signal processing) , computer vision , programming language

A key challenge in pattern recognition is how to scale the computational efficiency of clustering algorithms on large data sets. The extension of non‐Euclidean relational fuzzy c‐means (NERF) clustering to very large (VL = unloadable) relational data is called the extended NERF (eNERF) clustering algorithm, which comprises four phases: (i) finding distinguished features that monitor progressive sampling; (ii) progressively sampling from a N × N relational matrix R N to obtain a n × n sample matrix R n ; (iii) clustering R n with literal NERF; and (iv) extending the clusters in R n to the remainder of the relational data. Previously published examples on several fairly small data sets suggest that eNERF is feasible for truly large data sets. However, it seems that phases (i) and (ii), i.e., finding R n , are not very practical because the sample size n often turns out to be roughly 50% of n , and this over‐sampling defeats the whole purpose of eNERF. In this paper, we examine the performance of the sampling scheme of eNERF with respect to different parameters. We propose a modified sampling scheme for use with eNERF that combines simple random sampling with (parts of) the sampling procedures used by eNERF and a related algorithm sVAT (scalable visual assessment of clustering tendency). We demonstrate that our modified sampling scheme can eliminate over‐sampling of the original progressive sampling scheme, thus enabling the processing of truly VL data. Numerical experiments on a distance matrix of a set of 3,000,000 vectors drawn from a mixture of 5 bivariate normal distributions demonstrate the feasibility and effectiveness of the proposed sampling method. We also find that actually running eNERF on a data set of this size is very costly in terms of computation time. Thus, our results demonstrate that further modification of eNERF, especially the extension stage, will be needed before it is truly practical for VL data. © 2008 Wiley Periodicals, Inc.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research