The Relevant‐Set Correlation Model for Data Clustering | Zendy

Houle Michael E. | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

The Relevant‐Set Correlation Model for Data Clustering

Author(s) -

Houle Michael E.

Publication year - 2008

Publication title -

statistical analysis and data mining: the asa data science journal

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.381

H-Index - 33

eISSN - 1932-1872

pISSN - 1932-1864

DOI - 10.1002/sam.10013

Subject(s) - cluster analysis , computer science , data mining , oracle , set (abstract data type) , heuristic , similarity (geometry) , data set , representation (politics) , cluster (spacecraft) , scalability , correlation , information retrieval , artificial intelligence , mathematics , database , image (mathematics) , software engineering , politics , political science , law , programming language , geometry

This paper introduces a model for clustering, the Relevant‐Set Correlation (RSC) model, that requires no direct knowledge of the nature or representation of the data. Instead, the RSC model relies solely on the existence of an oracle that accepts a query in the form of a reference to a data item, and returns a ranked set of references to items that are most relevant to the query. The quality of cluster candidates, the degree of association between pairs of cluster candidates, and the degree of association between clusters and data items are all assessed according to the statistical significance of a form of correlation among pairs of relevant sets and/or candidate cluster sets. The RSC significance measures can be used to evaluate the relative importance of cluster candidates of various sizes, avoiding the problems of bias found with other shared‐neighbor methods that use fixed neighborhood sizes. A scalable clustering heuristic based on the RSC model is also presented and demonstrated for large, high‐dimensional datasets using a fast approximate similarity search structure as the oracle. © 2008 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 1: 000‐000, 2008

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research