A Bayesian mixture model for clustering and selection of feature occurrence rates under mean constraints | Zendy

Li Qiwei | Zendy; Guindani Michele | Zendy; Reich Brian J. | Zendy; Bondell Howard D. | Zendy; Vannucci Marina | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

A Bayesian mixture model for clustering and selection of feature occurrence rates under mean constraints

Author(s) -

Li Qiwei,

Guindani Michele,

Reich Brian J.,

Bondell Howard D.,

Vannucci Marina

Publication year - 2017

Publication title -

statistical analysis and data mining: the asa data science journal

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.381

H-Index - 33

eISSN - 1932-1872

pISSN - 1932-1864

DOI - 10.1002/sam.11350

Subject(s) - overdispersion , cluster analysis , normalization (sociology) , feature selection , model selection , computer science , mixture model , pattern recognition (psychology) , bayesian probability , count data , data set , poisson distribution , data mining , artificial intelligence , mathematics , statistics , sociology , anthropology

In this paper, we consider the problem of modeling a matrix of count data, where multiple features are observed as counts over a number of samples. Due to the nature of the data generating mechanism, such data are often characterized by a high number of zeros and overdispersion. In order to take into account the skewness and heterogeneity of the data, some type of normalization and regularization is necessary for conducting inference on the occurrences of features across samples. We propose a zero‐inflated Poisson mixture modeling framework that incorporates a model‐based normalization through prior distributions with mean constraints, as well as a feature selection mechanism, which allows us to identify a parsimonious set of discriminatory features, and simultaneously cluster the samples into homogenous groups. We show how our approach improves on the accuracy of the clustering with respect to more standard approaches for the analysis of count data, by means of a simulation study and an application to a bag‐of‐words benchmark data set, where the features are represented by the frequencies of occurrence of each word.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research