A general framework for efficient clustering of large datasets based on activity detection | Zendy

Jin Xin | Zendy; Kim Sangkyum | Zendy; Han Jiawei | Zendy; Cao Liangliang | Zendy; Yin Zhijun | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

A general framework for efficient clustering of large datasets based on activity detection

Author(s) -

Jin Xin,

Kim Sangkyum,

Han Jiawei,

Cao Liangliang,

Yin Zhijun

Publication year - 2011

Publication title -

statistical analysis and data mining: the asa data science journal

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.381

H-Index - 33

eISSN - 1932-1872

pISSN - 1932-1864

DOI - 10.1002/sam.10097

Subject(s) - cluster analysis , computer science , data mining , cure data clustering algorithm , canopy clustering algorithm , correlation clustering , data stream clustering , exploit , clustering high dimensional data , algorithm , machine learning , computer security

Data clustering is one of the most popular data mining techniques with broad applications. K‐Means is one of the most popular clustering algorithms, due to its high efficiency/effectiveness and wide implementation in many commercial/noncommercial softwares. Performing efficient clustering on large dataset is especially useful; however, conducting K‐Means clustering on large data suffers heavy computation burden which originates from the numerous distance calculations between the patterns and the centers. This paper proposes framework General Activity Detection (GAD) for fast clustering on large‐scale data based on center activity detection. Within this framework, we design a set of algorithms for different scenarios: (i) exact GAD algorithm, E‐GAD, which is much faster than K‐Means and gets the same clustering result; (ii) approximate GAD algorithms with different assumptions, which are faster than E‐GAD, while achieving different degrees of approximation; and (iii) GAD based algorithms to handle the large clusters problem which appears in many large‐scale clustering applications. The framework provides a general solution to exploit activity detection for fast clustering in both exact and approximate scenarios, and our proposed algorithms within the framework can achieve very high speed. We have conducted extensive experiments on several datasets from various real world applications, including data compression, image clustering, and bioinformatics. By measuring the clustering quality and CPU time, the experiment results show the effectiveness and high efficiency of our proposed algorithms. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 11–29 2011

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research