
Random Forest and Novel Under-Sampling Strategy for Data Imbalance in Software Defect Prediction
Author(s) -
Utomo Pujianto
Publication year - 2018
Publication title -
international journal of engineering and technology
Language(s) - English
Resource type - Journals
ISSN - 2227-524X
DOI - 10.14419/ijet.v7i4.15.21368
Subject(s) - random forest , sampling (signal processing) , software , centroid , data mining , statistics , computer science , measure (data warehouse) , systematic sampling , mathematics , artificial intelligence , computer vision , filter (signal processing) , programming language
Data imbalance is one among characteristics of software quality data sets that can have a negative effect on the performance of software defect prediction models. This study proposed an alternative to random under-sampling strategy by using only a subset of non-defective data which have been calculated as having biggest distance value to the centroid of defective data. Combined with random forest classification, the proposed method outperformed both the random under-sampling and non-sampling method on the basis of accuracy, AUC, f-measure, and true positive rate performance measures.