Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study | Zendy

Barbara Pes | Zendy; Giuseppina Lai | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study

Author(s) -

Barbara Pes,

Giuseppina Lai

Publication year - 2021

Publication title -

peerj computer science

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.806

H-Index - 24

ISSN - 2376-5992

DOI - 10.7717/peerj-cs.832

Subject(s) - machine learning , computer science , feature selection , artificial intelligence , leverage (statistics) , curse of dimensionality , feature (linguistics) , heuristics , class (philosophy) , univariate , data science , multivariate statistics , philosophy , linguistics , operating system

High dimensionality and class imbalance have been largely recognized as important issues in machine learning. A vast amount of literature has indeed investigated suitable approaches to address the multiple challenges that arise when dealing with high-dimensional feature spaces (where each problem instance is described by a large number of features). As well, several learning strategies have been devised to cope with the adverse effects of imbalanced class distributions, which may severely impact on the generalization ability of the induced models. Nevertheless, although both the issues have been largely studied for several years, they have mostly been addressed separately, and their combined effects are yet to be fully understood. Indeed, little research has been so far conducted to investigate which approaches might be best suited to deal with datasets that are, at the same time, high-dimensional and class-imbalanced. To make a contribution in this direction, our work presents a comparative study among different learning strategies that leverage both feature selection, to cope with high dimensionality, as well as cost-sensitive learning methods, to cope with class imbalance. Specifically, different ways of incorporating misclassification costs into the learning process have been explored. Also different feature selection heuristics have been considered, both univariate and multivariate, to comparatively evaluate their effectiveness on imbalanced data. The experiments have been conducted on three challenging benchmarks from the genomic domain, gaining interesting insight into the beneficial impact of combining feature selection and cost-sensitive learning, especially in the presence of highly skewed data distributions.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research