z-logo
open-access-imgOpen Access
Numero: a statistical framework to define multivariable subgroups in complex population-based datasets
Author(s) -
Song Gao,
Stefan Mutter,
Aaron Casey,
VillePetteri Mäkinen
Publication year - 2018
Publication title -
international journal of epidemiology
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 3.406
H-Index - 208
eISSN - 1464-3685
pISSN - 0300-5771
DOI - 10.1093/ije/dyy113
Subject(s) - cluster analysis , computer science , population , data mining , hierarchical clustering , machine learning , data science , medicine , environmental health
Large-scale epidemiological and population data provide opportunities to identify subgroups of people who are at risk of disease or exposed to adverse environments. Clustering algorithms are popular data-driven tools to identify these subgroups; however, relying exclusively on algorithms may not produce the best results if the dataset does not have a clustered structure. For this reason, we propose a framework (the R-library Numero) that combines the self-organizing map algorithm, permutation analysis for statistical evidence and a final expert-driven subgrouping step. We used Numero to define subgroups in two examples without an obvious clustering structure: a biomedical dataset of kidney disease and another dataset of community-level socioeconomic indicators. We benchmarked the Numero subgroupings against popular clustering algorithms (principal components, K-means and hierarchical clustering). The Numero subgroupings were more intuitive and easier to interpret without losing mathematical quality. Therefore, we expect Numero to be useful for exploratory analyses of population-based epidemiological datasets.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom