z-logo
open-access-imgOpen Access
Weighted knn using grey relational analysis for cross-project defect prediction
Author(s) -
DI Ulumi,
Daniel Siahaan
Publication year - 2019
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1230/1/012062
Subject(s) - computer science , random forest , feature selection , data mining , naive bayes classifier , feature (linguistics) , software , domain (mathematical analysis) , machine learning , artificial intelligence , selection (genetic algorithm) , process (computing) , focus (optics) , support vector machine , mathematics , mathematical analysis , philosophy , linguistics , physics , optics , programming language , operating system
Defect prediction plays important roles in detecting vulnerable component within a software. Some researchers have tried to improve the accuracy of software defect prediction so that it helps developer to manage resources (human, cost, and time) better. They focus on building the software defect prediction model only for a specific domain. To our knowledge, research on cross-project domains has not been carried out before. This research developed method to predict software defect for cross-project domains. Thus, the domain contains datasets with different number of features. To extend shorted features in a dataset, the method calculates the missing values. This research employed weighted KNN to fill in the missing value. The refilled datasets were then classified using naive bayes and random forest. This research also conducted a feature selection process to select relevant features for detecting defects by means of a comparative analysis of methods of selection of features. For the experimentation, this research used seven NASA public dataset MDPs. The results show that for imbalance data, naïve bayes combined with information gain (IG) or symmetric uncertainty (SU) feature selection produced the best balance, i.e. 0.4975. It also shows that for balance data, random forest combined with gain ratio (GR) produced the best balance, i.e. 0.7795. In general, the developed method performed relatively alike the previous method, which classify only specific domain, i.e. 0.4975. It even outperformed previous method for dataset PC2, i.e. 0.4033.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here