Comparative analysis of methods for prediction continuous numerical features on big datasets | Zendy

Eduard Kinshakov | Zendy; Yuliia Parfenenko | Zendy; Vira Shendryk | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Comparative analysis of methods for prediction continuous numerical features on big datasets

Author(s) -

Eduard Kinshakov,

Yuliia Parfenenko,

Vira Shendryk

Publication year - 2021

Publication title -

technology audit and production reserves

Language(s) - English

Resource type - Journals

eISSN - 2706-5448

pISSN - 2664-9969

DOI - 10.15587/2706-5448.2021.244003

Subject(s) - computer science , big data , random forest , python (programming language) , data mining , reliability (semiconductor) , decision tree , visualization , linear regression , machine learning , artificial intelligence , power (physics) , physics , quantum mechanics , operating system

The object of research is the process of choosing a method for predicting continuous numerical features on big datasets. The importance of the study is due to the fact that today in various subject areas it is necessary to solve the problem of predicting performance indicators based on data collected from different sources and presented in different formats, which is the task of big data analysis. To solve the problem, the methods of statistical analysis were considered, namely multiple linear regression, decision trees and a random forest. An array of extensive data was built without specifying the subject area, its preliminary processing, analysis was carried out to establish the correlation between the features. The processing of the big data array was carried out using the technology of parallel computing by means of the Dask library of the Python language. Since working with big data requires significant computing resources, this approach does not require the use of powerful computer technology. Prediction models were built using multiple linear regression methods, decision trees and a random forest, visualization of the prediction results and analysis of the reliability of the constructed models. Based on the results of calculating the prediction error, it was found that the greatest prediction accuracy among the considered methods is the random forest method. When applying this method, the prediction accuracy for a dataset of numerical features was approximately 97 %, which indicates a high reliability of the constructed model. Thus, it is possible to conclude that the random forest method is suitable for solving prediction problems using large data sets, it can be used for datasets with a large number of features and is not sensitive to data scaling. The developed software application in Python can be used to predict numerical features from different subject areas, the prediction results are imported into a text file.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore