The impact of feature types, classifiers, and data balancing techniques on software vulnerability prediction models | Zendy

Kaya Aydin | Zendy; Keceli Ali Seydi | Zendy; Catal Cagatay | Zendy; Tekinerdogan Bedir | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

The impact of feature types, classifiers, and data balancing techniques on software vulnerability prediction models

Author(s) -

Kaya Aydin,

Keceli Ali Seydi,

Catal Cagatay,

Tekinerdogan Bedir

Publication year - 2019

Publication title -

journal of software: evolution and process

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.371

H-Index - 29

eISSN - 2047-7481

pISSN - 2047-7473

DOI - 10.1002/smr.2164

Subject(s) - computer science , data mining , software , machine learning , random forest , vulnerability (computing) , software bug , software security assurance , categorization , feature (linguistics) , process (computing) , artificial intelligence , information security , computer security , linguistics , philosophy , security service , programming language , operating system

Software vulnerabilities form an increasing security risk for software systems, that might be exploited to attack and harm the system. Some of the security vulnerabilities can be detected by static analysis tools and penetration testing, but usually, these suffer from relatively high false positive rates. Software vulnerability prediction (SVP) models can be used to categorize software components into vulnerable and neutral components before the software testing phase and likewise increase the efficiency and effectiveness of the overall verification process. The performance of a vulnerability prediction model is usually affected by the adopted classification algorithm, the adopted features, and data balancing approaches. In this study, we empirically investigate the effect of these factors on the performance of SVP models. Our experiments consist of four data balancing methods, seven classification algorithms, and three feature types. The experimental results show that data balancing methods are effective for highly unbalanced datasets, text‐based features are more useful, and ensemble‐based classifiers provide mostly better results. For smaller datasets, Random Forest algorithm provides the best performance and for the larger datasets, RusboostTree achieves better performance.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore