
A sketch for the KS test for Big Data
Author(s) -
Thalis D. Galeno,
João Gama,
Douglas O. Cardoso
Publication year - 2021
Language(s) - English
Resource type - Conference proceedings
DOI - 10.5753/kdmile.2021.17455
Subject(s) - univariate , test statistic , computer science , sketch , statistic , kolmogorov–smirnov test , goodness of fit , sample (material) , parametric statistics , statistical hypothesis testing , statistics , algorithm , sample space , test (biology) , data mining , mathematics , artificial intelligence , machine learning , multivariate statistics , chemistry , chromatography , paleontology , biology
Motivated by the challenges of Big Data, this paper presents an approximative algorithm to assess the Kolmogorov-Smirnov test. This goodness of fit statistical test is extensively used because it is non-parametric. This work focuses on the one-sample test, which considers the hypothesis that a given univariate sample follows some reference distribution. The method allows to evaluate the departure from such a distribution of a input stream, being space and time efficient. We show the accuracy of our algorithm by making several experiments in different scenarios: varying reference distribution and its parameters, sample size, and available memory. The performance of rival methods, some of which are considered the state-of-the-art, were compared. It is demonstrated that our algorithm is superior in most of the cases, considering the absolute error of the test statistic.