z-logo
open-access-imgOpen Access
Security Bug Report Classification via Cross-Project Similarity-Based Data Augmentation and Deep Learning Models
Author(s) -
Murun Ganzorig,
Jinfeng Ji,
Geunseok Yang
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3614638
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Accurate classification of security bug reports is a persistent challenge in software engineering due to the limited availability of labeled security data within individual projects. This study investigates whether semantically similar bug reports from external projects can be used to augment training data and improve classification performance in cross-project scenarios. Twelve text similarity techniques are evaluated, including four lexical methods such as TF-IDF, Jaccard, Levenshtein, and BM25, as well as eight embedding-based approaches such as Word2Vec, GloVe, FastText, Doc2Vec, BERT, SBERT, USE, and BERTScore. These similarity methods are applied to identify relevant bug reports from four open-source projects. The selected reports are combined with the security-labeled reports from each target project to create augmented training datasets. Five deep learning models are trained and evaluated using these datasets: CNN, LSTM, GRU, Transformer, and BERT. The experimental results show that CNN, LSTM, and GRU consistently achieve strong performance, with F1-scores frequently exceeding 0.92 across multiple projects and similarity methods. Both lexical and embedding-based similarity techniques contribute positively to performance, although their effectiveness varies depending on the model and project. No single similarity method performs best in all settings, but specific combinations of similarity techniques and model architectures lead to significantly improved classification outcomes. These findings highlight the practical value of cross-project data augmentation for security bug report classification. They also underscore the importance of selecting similarity methods and model architectures that align with the characteristics of the target project and available training data.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom