z-logo
Premium
Mixture‐based clustering for count data using approximated Fisher Scoring and Minorization–Maximization approaches
Author(s) -
Bregu Ornela,
Zamzami Nuha,
Bouguila Nizar
Publication year - 2021
Publication title -
computational intelligence
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.353
H-Index - 52
eISSN - 1467-8640
pISSN - 0824-7935
DOI - 10.1111/coin.12429
Subject(s) - cluster analysis , mixture model , count data , dirichlet distribution , multinomial distribution , computer science , hyperparameter , burstiness , overdispersion , mathematics , algorithm , artificial intelligence , statistics , poisson distribution , mathematical analysis , computer network , network packet , boundary value problem
The multinomial distribution has been widely used to model count data. To increase clustering efficiency, we use an approximation to the Fisher scoring algorithm, which is more robust regarding the choice of initial parameter values. Then, we use a novel approach to estimate the optimal number of components, based on minimum message length criterion. Moreover, we consider a generalization of the multinomial model obtained by introducing the Dirichlet as prior, yielding the Dirichlet Compound Multinomial (DCM). Even though DCM can address the burstiness phenomenon of count data, the presence of Gamma function in its density function usually leads to undesired complications. In this article, we use two alternative representations of DCM distribution to perform clustering based on finite mixture models, where the mixture parameters are estimated using the minorization–maximization framework. To evaluate and compare the performance of our proposed models, we have considered three challenging real‐world applications that involve high‐dimensional count vectors, namely, sentiment analysis, facial expression recognition, and human action recognition. The results show that the proposed algorithms increase the clustering efficiency of their respective models remarkably, and the best results are achieved by the second parametrization of DCM, which can accommodate over‐dispersed count data.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here