z-logo
Premium
Clustering large data sets described with discrete distributions and its application on TIMSS data set
Author(s) -
KorenjakČerne Simona,
Batagelj Vladimir,
Japelj Pavešić Barbara
Publication year - 2011
Publication title -
statistical analysis and data mining: the asa data science journal
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.381
H-Index - 33
eISSN - 1932-1872
pISSN - 1932-1864
DOI - 10.1002/sam.10105
Subject(s) - cluster analysis , hierarchical clustering , computer science , data mining , set (abstract data type) , representation (politics) , data set , consensus clustering , correlation clustering , theoretical computer science , mathematics , cure data clustering algorithm , artificial intelligence , politics , political science , law , programming language
Symbolic data analysis is based on special descriptions of data—symbolic objects. Such descriptions preserve more detailed information about the data than the standard representations with mean values. A special kind of symbolic object is also representation with distributions. In the clustering process this representation enables us to consider the variables of all types at the same time. We present two clustering methods based on the data descriptions with discrete distributions: the adapted leaders method and the adapted agglomerative hierarchical clustering Ward's method. Both methods are compatible—they can be viewed as two approaches for solving the same clustering optimization problem. In the obtained clustering, the leader is assigned to each cluster. The descriptions of the leaders offer simple interpretation of the clusters' characteristics. The leaders method enables us to efficiently solve clustering problems with a large number of units; while the agglomerative method is applied on the obtained leaders and enables us to decide upon the right number of clusters on the basis of the corresponding dendrogram. Both methods were successfully applied in analyses of different data sets. In this paper, an application on the Trends in International Mathematics and Science Study (TIMSS) data set is presented. The descriptions with distributions enable us to combine two data sets: answers of teachers and answers of their students, into one data set. The descriptions of the obtained clusters enable us to interpret the results in a more understandable way. © 2011 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 4: 199–215, 2011

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here