z-logo
Premium
An empirical comparison of validation methods for software prediction models
Author(s) -
Ali Asad,
Gravino Carmine
Publication year - 2021
Publication title -
journal of software: evolution and process
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.371
H-Index - 29
eISSN - 2047-7481
pISSN - 2047-7473
DOI - 10.1002/smr.2367
Subject(s) - computer science , cross validation , data mining , model validation , software , data validation , predictive modelling , variance (accounting) , machine learning , programming language , data science , accounting , database , business
Model validation methods (e.g., k‐ fold cross‐validation) use historical data to predict how well an estimation technique (e.g., random forest) performs on the current (or future) data. Studies in the contexts of software development effort estimation (SDEE) and software fault prediction (SFP) have used and investigated different model validation methods. However, no conclusive indications to suggest which model validation method has a major impact on the prediction accuracy and stability of estimation techniques. Some studies have investigated model validation methods specific to data about either SDEE or SFP. To the best of our knowledge, there is no study in the literature, which has employed different validation methods both with SDEE and SFP data. The aim of this paper is to consider different methods (10) from the family of cross‐validation (CV) and bootstrap validation methods to identify which one contributes to obtaining a better prediction accuracy for both types of data. We also evaluate which model validation methods allow the estimation techniques to provide stable performances (i.e., with lower variance). To this aim, we present an empirical study involving six datasets from the domain of SDEE and six datasets from the SFP domain. The results reveal that repeated 10‐fold CV with SDEE and optimistic boot with SFP data are the model validation methods that provide a better prediction accuracy in a greater number of experiments than the other model validation methods. Furthermore, a model validation method can improve the prediction accuracy up to 60% with SDEE data and up to 36% when employing SFP data. The analysis also reveals that repeated fivefold CV produces more stable performances when the experiments are repeated on the same data.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here