Clustering large data sets described with discrete distributions and its application on TIMSS data set | Zendy

KorenjakČerne Simona | Zendy; Batagelj Vladimir | Zendy; Japelj Pavešić Barbara | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Clustering large data sets described with discrete distributions and its application on TIMSS data set

Author(s) -

KorenjakČerne Simona,

Batagelj Vladimir,

Japelj Pavešić Barbara

Publication year - 2011

Publication title -

statistical analysis and data mining: the asa data science journal

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.381

H-Index - 33

eISSN - 1932-1872

pISSN - 1932-1864

DOI - 10.1002/sam.10105

Subject(s) - cluster analysis , hierarchical clustering , computer science , data mining , set (abstract data type) , representation (politics) , data set , consensus clustering , correlation clustering , theoretical computer science , mathematics , cure data clustering algorithm , artificial intelligence , politics , political science , law , programming language

Symbolic data analysis is based on special descriptions of data—symbolic objects. Such descriptions preserve more detailed information about the data than the standard representations with mean values. A special kind of symbolic object is also representation with distributions. In the clustering process this representation enables us to consider the variables of all types at the same time. We present two clustering methods based on the data descriptions with discrete distributions: the adapted leaders method and the adapted agglomerative hierarchical clustering Ward's method. Both methods are compatible—they can be viewed as two approaches for solving the same clustering optimization problem. In the obtained clustering, the leader is assigned to each cluster. The descriptions of the leaders offer simple interpretation of the clusters' characteristics. The leaders method enables us to efficiently solve clustering problems with a large number of units; while the agglomerative method is applied on the obtained leaders and enables us to decide upon the right number of clusters on the basis of the corresponding dendrogram. Both methods were successfully applied in analyses of different data sets. In this paper, an application on the Trends in International Mathematics and Science Study (TIMSS) data set is presented. The descriptions with distributions enable us to combine two data sets: answers of teachers and answers of their students, into one data set. The descriptions of the obtained clusters enable us to interpret the results in a more understandable way. © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 199–215, 2011

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research