
Comparison of some correlation measures for continuous and categorical data
Author(s) -
Ewa Skotarczak,
A. Dobek,
Krzysztof Moliński
Publication year - 2019
Publication title -
biometrical letters/biometrical letters
Language(s) - English
Resource type - Journals
eISSN - 2199-577X
pISSN - 1896-3811
DOI - 10.2478/bile-2019-0015
Subject(s) - estimator , categorical variable , mathematics , statistics , dependency (uml) , pearson product moment correlation coefficient , correlation coefficient , linear regression , discretization , ordinal regression , correlation , ordinal data , computer science , artificial intelligence , mathematical analysis , geometry
Summary In the literature there can be found a wide collection of correlation and association coefficients used for different structures of data. Generally, some of the correlation coefficients are conventionally used for continuous data and others for categorical or ordinal observations. The aim of this paper is to verify the performance of various approaches to correlation coefficient estimation for several types of observations. Both simulated and real data were analysed. For continuous variables, Pearson’s r 2 and MIC were determined, whereas for categorized data three approaches were compared: Cramér’s V , Joe’s estimator, and the regression-based estimator. Two method of discretization for continuous data were used. The following conclusions were drawn: the regression-based approach yielded the best results for data with the highest assumed r 2 coefficient, whereas Joe’s estimator was the better approximation of true correlation when the assumed r 2 was small; and the MIC estimator detected the maximal level of dependency for data having a quadratic relation. Moreover, the discretization method applied to data with a non-linear dependency can cause loss of dependency information. The calculations were supported by the R packages arules and minerva .