Open Access
A PRELIMINARY STUDY OF THE EFFECTS OF WITHIN‐GROUP COVARIANCE STRUCTURE ON RECOVERY IN CLUSTER ANALYSIS
Author(s) -
Donoghue John R.
Publication year - 1994
Publication title -
ets research report series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.235
H-Index - 5
ISSN - 2330-8516
DOI - 10.1002/j.2333-8504.1994.tb01619.x
Subject(s) - mathematics , centroid , correlation , statistics , cluster analysis , similarity (geometry) , covariance , hierarchical clustering , bivariate analysis , group (periodic table) , analysis of covariance , artificial intelligence , computer science , chemistry , geometry , organic chemistry , image (mathematics)
ABSTRACT Two Monte Carlo studies investigated the effects of within‐group covariance structure on subgroup recovery by several widely used hierarchical clustering methods. Data sets were 100 bivariate observations from two subgroups, generated according to a finite normal mixture model. In Study 1, subgroup size, within‐group correlation, within‐group variance, and distance between subgroup centroids were manipulated. All clustering methods were strongly affected by within‐group correlation; negative correlation yielded much poorer recovery. Smaller effects were found for the interaction of clustering method with within‐group variance. Study 2 separated the effects of direction of correlation from the direction of differences in the subgroup centroids. Subgroup size, within‐group correlation, direction of the vector separating subgroup centroids, and distance between subgroup centroids were manipulated. Superior recovery was associated with within‐group correlation that matched the direction of subgroup separation. Overall, the EML algorithm of SAS yielded best recovery, followed closely by Ward's method, average linkage, and a version of the beta‐flexible algorithm, although several interactions were noted. The results are interpreted according to the weakness of the (squared) Euclidean distance as a measure of (dis)similarity for cluster analysis. Several alternative measures are discussed, and promising alternatives are identified for future investigation.