
SiGMoiD: A super-statistical generative model for binary data
Author(s) -
Xiaochuan Zhao,
Germán Plata,
Purushottam D. Dixit
Publication year - 2021
Publication title -
plos computational biology/plos computational biology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 2.628
H-Index - 182
eISSN - 1553-7358
pISSN - 1553-734X
DOI - 10.1371/journal.pcbi.1009275
Subject(s) - sigmoid function , binary number , computer science , probabilistic logic , statistical model , generative model , principle of maximum entropy , binary data , data point , sample size determination , identification (biology) , sample (material) , entropy (arrow of time) , algorithm , mathematics , artificial intelligence , generative grammar , statistics , botany , chemistry , arithmetic , physics , chromatography , quantum mechanics , artificial neural network , biology
In modern computational biology, there is great interest in building probabilistic models to describe collections of a large number of co-varying binary variables. However, current approaches to build generative models rely on modelers’ identification of constraints and are computationally expensive to infer when the number of variables is large ( N ~100). Here, we address both these issues with S uper-stat i stical G enerative Mo del for b i nary D ata (SiGMoiD). SiGMoiD is a maximum entropy-based framework where we imagine the data as arising from super-statistical system; individual binary variables in a given sample are coupled to the same ‘bath’ whose intensive variables vary from sample to sample. Importantly, unlike standard maximum entropy approaches where modeler specifies the constraints, the SiGMoiD algorithm infers them directly from the data. Due to this optimal choice of constraints, SiGMoiD allows us to model collections of a very large number ( N >1000) of binary variables. Finally, SiGMoiD offers a reduced dimensional description of the data, allowing us to identify clusters of similar data points as well as binary variables. We illustrate the versatility of SiGMoiD using multiple datasets spanning several time- and length-scales.