Efficient mining of distance‐based subspace clusters | Zendy

Liu Guimei | Zendy; Sim Kelvin | Zendy; Li Jinyan | Zendy; Wong Limsoon | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Efficient mining of distance‐based subspace clusters

Author(s) -

Liu Guimei,

Sim Kelvin,

Li Jinyan,

Wong Limsoon

Publication year - 2009

Publication title -

statistical analysis and data mining: the asa data science journal

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.381

H-Index - 33

eISSN - 1932-1872

pISSN - 1932-1864

DOI - 10.1002/sam.10062

Subject(s) - computer science , partition (number theory) , data mining , cluster analysis , subspace topology , sliding window protocol , grid , rectangle , linear subspace , key (lock) , space partitioning , cluster (spacecraft) , similarity (geometry) , algorithm , window (computing) , pattern recognition (psychology) , artificial intelligence , mathematics , image (mathematics) , geometry , computer security , combinatorics , programming language , operating system

Traditional similarity measurements often become meaningless when dimensions of datasets increase. Subspace clustering has been proposed to find clusters embedded in subspaces of high‐dimensional datasets. Many existing algorithms use a grid‐based approach to partition the data space into nonoverlapping rectangle cells, and then identify connected dense cells as clusters. The rigid boundaries of the grid‐based approach may cause a real cluster to be divided into several small clusters. In this paper, we propose to use a sliding‐window approach to partition the dimensions to preserve significant clusters. We call this model nCluster model. The sliding‐window approach generates more bins than the grid‐based approach, thus it incurs higher mining cost. We develop a deterministic algorithm, called MaxnCluster, to mine nClusters efficiently. MaxnCluster uses several techniques to speed up the mining, and it produces only maximal nClusters to reduce result size. Non‐maximal nClusters are pruned without the need of storing the discovered nClusters in the memory, which is key to the efficiency of MaxnCluster. Our experiment results show that (i) the nCluster model can indeed preserve clusters that are shattered by the grid‐based approach on synthetic datasets; (ii) the nCluster model produces more significant clusters than the grid‐based approach on two real gene expression datasets and (iii) MaxnCluster is efficient in mining maximal nClusters. Copyright © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2: 427‐444, 2009

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research