Heterogeneous Data Fusion: A Scalable Approach to Intrusion Detection
Author(s) -
Seonghyeon Gong,
Jake Cho,
Kyuwon Ken Choi
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3620722
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Machine Learning-based Intrusion Detection Systems (ML-IDS) are core functionalities in responding to today’s cyber-attacks by learning, detecting, and classifying various attack patterns. However, despite achieving high overall accuracy, existing ML-IDS approaches suffer from high false positive and false negative rates for certain attack patterns due to limited generalization performances. This research proposes a novel dataset construction method that enhances the performance of ML-IDS by integrating heterogeneous security data to expand feature representations. Our approach integrates data collected from heterogeneous domains based on timestamps and evaluates the expanded feature space regarding information gain and entropy difference. The proposed method dynamically adjusts the time window for data fusion based on the evaluation of the feature space, thereby generating an optimal dataset. Our approach leverages multiple security data sources to enhance dataset quality and improve the classification performance of ML-IDS models. Experimental results demonstrate that the proposed dataset fusion mechanism enhances learning and generalization performance. Experimental results of the dataset reconstruction demonstrate improved performance of multiple baseline models on the CIC-IDS-2018 dataset, particularly in detecting attack patterns with previously high false positive rates. Notably, base models trained on the reconstructed dataset achieved a marco f1-score of 0.9968, surpassing state-of-the-art baselines. These results demonstrate that our approach to improving dataset quality can effectively enhance the performance of existing ML-IDS.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom