Premium
On two‐way Bayesian agglomerative clustering of gene expression data
Author(s) -
Fowler Anna,
Heard Nicholas A.
Publication year - 2012
Publication title -
statistical analysis and data mining: the asa data science journal
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.381
H-Index - 33
eISSN - 1932-1872
pISSN - 1932-1864
DOI - 10.1002/sam.11162
Subject(s) - hierarchical clustering , cluster analysis , curse of dimensionality , computer science , single linkage clustering , similarity (geometry) , data mining , multiplicative function , algorithm , pattern recognition (psychology) , mathematics , expression (computer science) , cure data clustering algorithm , artificial intelligence , correlation clustering , mathematical analysis , image (mathematics) , programming language
This article introduces an agglomerative Bayesian model‐based clustering algorithm which outputs a nested sequence of two‐way cluster configurations for an input matrix of data. Each two‐way cluster configuration in the output hierarchy is specified by a row configuration and a column configuration whose Cartesian product partitions the data matrix. Variable selection is incorporated into the algorithm by identifying row clusters which form distinct groups defined by the column clusters, through the use of a mixture model. A primitive similarity measure between the two clusters is the multiplicative change in model posterior probability implied by their merger, and the hierarchy is formed by iteratively merging the cluster pair which maximize some fixed monotonic function of this quantity. A naive implementation of the algorithm would be to choose this function to be the identity function. However, when applying this naive algorithm to gene expression data where the number of genes being studied typically far exceeds the number of experimental samples available, this imbalanced dimensionality of the data results in an algorithmic bias toward merging samples. To counteract this bias, alternative functions of the similarity measure are considered which prevent degenerative behavior of the algorithm. The resulting improvements in the output cluster configurations are demonstrated on simulated data and the method is then applied to real gene expression data. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012