
COMPARATIVE ANALYSIS OF MACHINE LEARNING MODELS AND REGRESSIONS FOR CAR PRICE PREDICTION
Author(s) -
Elvira Kovpak,
Fedir Orlov
Publication year - 2019
Publication title -
vìsnik harkìvsʹkogo nacìonalʹnogo unìversitetu ìmenì v.n. karazìna. serìâ ekonomìka
Language(s) - English
Resource type - Journals
ISSN - 2311-2379
DOI - 10.26565/2311-2379-2019-97-04
Subject(s) - random forest , gradient boosting , artificial neural network , computer science , machine learning , decision tree , artificial intelligence , multivariate adaptive regression splines , regression analysis , predictive modelling , linear regression , ensemble learning , python (programming language) , data mining , nonparametric regression , operating system
The purpose of the research described in this article is a comparative analysis of the predictive qualities of some models of machine learning and regression. The factors for models are the consumer characteristics of a used car: brand, transmission type, drive type, engine type, mileage, body type, year of manufacture, seller's region in Ukraine, condition of the car, information about accident, average price for analogue in Ukraine, engine volume, quantity of doors, availability of extra equipment, quantity of passenger’s seats, the first registration of a car, car was driven from abroad or not. Qualitative variables has been encoded as binary variables or by mean target encoding. The information about more than 200 thousand cars have been used for modeling. All models have been evaluated in the Python Software using Sklearn, Catboost, StatModels and Keras libraries. The following regression models and machine learning models were considered in the course of the study: linear regression; polynomial regression; decision tree; neural network; models based on "k-nearest neighbors", "random forest", "gradient boosting" algorithms; ensemble of models. The article presents the best in terms of quality (according to the criteria R2, MAE, MAD, MAPE) options from each class of models. It has been found that the best way to predict the price of a passenger car is through non-linear models. The results of the modeling show that the dependence between the price of a car and its characteristics is best described by the ensemble of models, which includes a neural network, models using "random forest" and "gradient boosting" algorithms. The ensemble of models showed an average relative approximation error of 11.2% and an average relative forecast error of 14.34%. All nonlinear models for car price have approximately the same predictive qualities (the difference between the MAPE within 2%) in this research.