Premium
The Relevant‐Set Correlation Model for Data Clustering
Author(s) -
Houle Michael E.
Publication year - 2008
Publication title -
statistical analysis and data mining: the asa data science journal
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.381
H-Index - 33
eISSN - 1932-1872
pISSN - 1932-1864
DOI - 10.1002/sam.10013
Subject(s) - cluster analysis , computer science , data mining , oracle , set (abstract data type) , heuristic , similarity (geometry) , data set , representation (politics) , cluster (spacecraft) , scalability , correlation , information retrieval , artificial intelligence , mathematics , database , image (mathematics) , software engineering , politics , political science , law , programming language , geometry
This paper introduces a model for clustering, the Relevant‐Set Correlation (RSC) model, that requires no direct knowledge of the nature or representation of the data. Instead, the RSC model relies solely on the existence of an oracle that accepts a query in the form of a reference to a data item, and returns a ranked set of references to items that are most relevant to the query. The quality of cluster candidates, the degree of association between pairs of cluster candidates, and the degree of association between clusters and data items are all assessed according to the statistical significance of a form of correlation among pairs of relevant sets and/or candidate cluster sets. The RSC significance measures can be used to evaluate the relative importance of cluster candidates of various sizes, avoiding the problems of bias found with other shared‐neighbor methods that use fixed neighborhood sizes. A scalable clustering heuristic based on the RSC model is also presented and demonstrated for large, high‐dimensional datasets using a fast approximate similarity search structure as the oracle. © 2008 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 1: 000‐000, 2008