Premium
On the behaviour of permutation‐based variable importance measures in random forest clustering
Author(s) -
Nembrini Stefano
Publication year - 2019
Publication title -
journal of chemometrics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.47
H-Index - 92
eISSN - 1099-128X
pISSN - 0886-9383
DOI - 10.1002/cem.3135
Subject(s) - random forest , permutation (music) , cluster analysis , variable (mathematics) , random permutation , computer science , artificial intelligence , class (philosophy) , bayesian probability , machine learning , random variable , pattern recognition (psychology) , relevance (law) , data mining , mathematics , statistics , mathematical analysis , physics , acoustics , geometry , political science , law , block (permutation group theory)
Unsupervised random forest (RF) is a popular clustering method that can be implemented by artificially creating a two‐class problem. Variable importance measures (VIMs) can be used to determine which variables are relevant for defining the RF dissimilarity, but they have not received as much attention as the supervised case. Here, I show that sampling schemes used in generating the artificial data—including the original one—can influence the behaviour of the permutation importance in a way that can affect conclusions on variable relevance and also propose a solution. Generating the artificial data using a Bayesian bootstrap keeps the desirable properties of the permutation VIM.