COMPARATIVE ANALYSIS OF INFORMATIVE FEATURES QUANTITY AND COMPOSITION SELECTION METHODS FOR THE COMPUTER ATTACKS CLASSIFICATION USING THE UNSW-NB15 DATASET | Zendy

Oleg I. Sheluhin | Zendy; Valentina P. Ivannikova | Zendy

Open Access

COMPARATIVE ANALYSIS OF INFORMATIVE FEATURES QUANTITY AND COMPOSITION SELECTION METHODS FOR THE COMPUTER ATTACKS CLASSIFICATION USING THE UNSW-NB15 DATASET

Author(s) -

Oleg I. Sheluhin,

Valentina P. Ivannikova

Publication year - 2020

Publication title -

t-comm

Language(s) - English

Resource type - Journals

eISSN - 2072-8743

pISSN - 2072-8735

DOI - 10.36724/2072-8735-2020-14-10-53-60

Subject(s) - overfitting , computer science , feature selection , machine learning , artificial intelligence , python (programming language) , set (abstract data type) , selection (genetic algorithm) , feature (linguistics) , data mining , model selection , data set , artificial neural network , linguistics , philosophy , programming language , operating system

A comparative analysis of statistical and model-based methods for selecting the quantity and the composition of informative features was performed using the UNSW-NB15 database for machine learning models training for attack detection. Feature selection is one of the most important steps in data preparation for machine learning tasks. It allows to increase a quality of machine learning models: it reduces sizes of the fitted models, training time and probability of overfitting. The research was conducted using Python programming language libraries: scikit-learn, which includes various machine learning models and functions for data preparation and models estimation, and FeatureSelector, which contains functions for statistical data analysis. Numerical results of experimental research of application of both statistical methods of features selection and machine learning models-based methods are provided. As the result, the reduced set of features is obtained, which allows improving the quality of classification by removing noise features that have little effect on the final result and reducing the quantity of informative features of the data set from 41 to 17. It is shown that the most effective among the analyzed methods for feature selection is the statistical method SelectKBest with the function chi2, which allows to obtain a reduced set of features providing an accuracy of classification as high as 90% in comparation with 74% provided with the full set.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore