
Sequence count data are poorly fit by the negative binomial distribution
Author(s) -
Stijn Hawinkel,
J. C. W. Rayner,
Luc Bijnens,
Olivier Thas
Publication year - 2020
Publication title -
plos one
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.99
H-Index - 332
ISSN - 1932-6203
DOI - 10.1371/journal.pone.0224909
Subject(s) - negative binomial distribution , count data , false discovery rate , statistics , goodness of fit , nonparametric statistics , binomial distribution , parametric statistics , mathematics , multinomial distribution , binomial test , statistical hypothesis testing , computer science , biology , poisson distribution , genetics , gene
Sequence count data are commonly modelled using the negative binomial (NB) distribution. Several empirical studies, however, have demonstrated that methods based on the NB-assumption do not always succeed in controlling the false discovery rate (FDR) at its nominal level. In this paper, we propose a dedicated statistical goodness of fit test for the NB distribution in regression models and demonstrate that the NB-assumption is violated in many publicly available RNA-Seq and 16S rRNA microbiome datasets. The zero-inflated NB distribution was not found to give a substantially better fit. We also show that the NB-based tests perform worse on the features for which the NB-assumption was violated than on the features for which no significant deviation was detected. This gives an explanation for the poor behaviour of NB-based tests in many published evaluation studies. We conclude that nonparametric tests should be preferred over parametric methods.