
COMPARATIVE ANALYSIS OF INFORMATIVE FEATURES QUANTITY AND COMPOSITION SELECTION METHODS FOR THE COMPUTER ATTACKS CLASSIFICATION USING THE UNSW-NB15 DATASET
Author(s) -
Oleg I. Sheluhin,
Valentina P. Ivannikova
Publication year - 2020
Publication title -
t-comm
Language(s) - English
Resource type - Journals
eISSN - 2072-8743
pISSN - 2072-8735
DOI - 10.36724/2072-8735-2020-14-10-53-60
Subject(s) - overfitting , computer science , feature selection , machine learning , artificial intelligence , python (programming language) , set (abstract data type) , selection (genetic algorithm) , feature (linguistics) , data mining , model selection , data set , artificial neural network , linguistics , philosophy , programming language , operating system
A comparative analysis of statistical and model-based methods for selecting the quantity and the composition of informative features was performed using the UNSW-NB15 database for machine learning models training for attack detection. Feature selection is one of the most important steps in data preparation for machine learning tasks. It allows to increase a quality of machine learning models: it reduces sizes of the fitted models, training time and probability of overfitting. The research was conducted using Python programming language libraries: scikit-learn, which includes various machine learning models and functions for data preparation and models estimation, and FeatureSelector, which contains functions for statistical data analysis. Numerical results of experimental research of application of both statistical methods of features selection and machine learning models-based methods are provided. As the result, the reduced set of features is obtained, which allows improving the quality of classification by removing noise features that have little effect on the final result and reducing the quantity of informative features of the data set from 41 to 17. It is shown that the most effective among the analyzed methods for feature selection is the statistical method SelectKBest with the function chi2, which allows to obtain a reduced set of features providing an accuracy of classification as high as 90% in comparation with 74% provided with the full set.