Evaluating Imputation Methods to Improve Data Availability in a Software Estimation Dataset | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Evaluating Imputation Methods to Improve Data Availability in a Software Estimation Dataset

Publication year - 2019

Publication title -

international journal of recent technology and engineering

Language(s) - English

Resource type - Journals

ISSN - 2277-3878

DOI - 10.35940/ijrte.b1025.0982s1119

Subject(s) - imputation (statistics) , random forest , missing data , computer science , cart , usability , regression , gradient boosting , data mining , decision tree , boosting (machine learning) , software , regression analysis , statistics , machine learning , artificial intelligence , mathematics , mechanical engineering , human–computer interaction , engineering , programming language

Missing of partial data is a problem that is prevalent in most of the datasets used for statistical analysis. In this study, we analyzed the missing values in ISBSG R1 2018 dataset and addressed the problem through imputation, a machine learning technique which can increase the availability of data. Additionally, we compare the performance of three imputation methods: Classification and Regression Trees (CART), Polynomial Regression (PR), Predictive Mean Matching (PMM), and Random Forest (RF) applied to ISBSG R1 2018 dataset available from International Standards Benchmarks Group. Through imputation, we were able to increase data availability by four times. We also evaluated the performance of these methods against the original dataset without imputation using an ensemble of Linear Regression, Gradient Boosting, Random Forest, and ANN. Imputation using CART can increase the availability of the overall dataset but only at the loss of some predictive capability of the model. However, CART remains the option of choice to extend the usability of the data by retaining rows that are otherwise removed from the dataset in traditional methods. In our experiments, this approach has been able to increase the usability of the original dataset to 63%, but with 2 to 3% decrease in its overall predictive performance.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore