Performance of Classification Algorithms Under Class Imbalance: Simulation and Real-World Evidence
Author(s) -
Iqra Arshad,
Muhammad Umair,
Faheem Jan,
Hasnain Iftikhar,
Paulo Canas Rodrigues,
Elias A. Torres Torres Armas,
Javier Linkolk Lopez-Gonzales
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3620264
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Class imbalance is a persistent challenge in machine learning, particularly in high-stakes applications such as medical diagnostics, bioinformatics, and fraud detection, where the minority class often represents critical cases. While prior research has examined the effect of imbalance on classifier performance, little attention has been paid to establishing practical guidelines for the minimum proportion of minority samples required to achieve reliable sensitivity. In this study, we conduct extensive simulations using synthetic datasets and evaluate five widely used classification algorithms: Logistic Regression (Logit), Support Vector Machines (SVM), Random Forest, XGBoost, and Neural Networks (NNs). Our analysis reveals that logistic regression is more effective in identifying minority-class instances under an imbalanced class distribution in terms of F1 score and sensitivity, whereas Neural Network slightly performs better for a balanced-class distribution than logistic regression. Importantly, we identify a practical threshold for minority class representation: classifier sensitivity declines sharply when positive samples fall below approximately 25–30%. This finding is validated on eight real-world datasets, including large-scale applications, where Neural Networks and XGBoost demonstrate superior sensitivity. By establishing an actionable threshold, this study contributes practical guidance for dataset design and model selection in imbalanced classification problems.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom