
Sample Denoising and Optimization Technique Based on Noise Filtering and Evolutionary Algorithms for Imbalanced Data Classification
Author(s) -
Fhira Nhita,
Asniar,
Isman Kurniawan,
Adiwijaya
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3573786
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Imbalanced data remains a challenge in classification research and significantly influences classifier performance. The strategy that is widely used to address this issue is the data-level approach or sampling method, through over-sampling, under-sampling, or hybrid-sampling methods. However, data quality problems, such as the presence of noise disrupt the sampling process and adversely affect classifier performance, particularly in the popular over-sampling method, such as Synthetic Minority over-sampling Technique (SMOTE). Therefore, the data preprocessing strategy at the before and after data balancing process is crucial to improve data quality before the classification process is conducted. This study proposes a method to improve the data balancing process by integrating two preprocessing steps with the SMOTE sampling technique. Technically, we performed a sample denoising process with Tomek links before applying the SMOTE and then followed by sample optimization with an evolutionary algorithm after the SMOTE. A genetic algorithm (GA) as one of the popular evolutionary algorithms is utilized for sample optimization including synthetic samples of SMOTE and original samples from both classes. Then, the selected train set is used to develop classification model using five classifier, i.e., decision tree, logistic regression, support vector machine, k-nearest neighbors, and naive bayes. Experimental results and statistical evaluations on 24 real-world imbalanced datasets demonstrate that our proposed method Tomek-SMOTE-GA (TSGA) is significantly better than baseline and state-of-the-art sampling methods in term of geometric-mean, particularly when using decision tree classifiers.