
A Big Data Framework for Scalable and Cross-Dataset Capable Machine Learning in Network Intrusion Detection Systems
Author(s) -
Vinicius M. De Oliveira,
Henrique M. De Oliveira,
Gabriel M. Santos,
Jhonatan Geremias,
Eduardo K. Viegas
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3589872
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Network Intrusion Detection Systems (NIDS) are widely used to secure modern networks, but deploying accurate and scalable Machine Learning (ML)-based detection in high-speed environments remains challenging. Traditional approaches often fail to generalize across different network environments, leading to significant performance degradation in cross-dataset evaluations. Additionally, ensuring near real-time inference while ingesting large volumes of network events requires efficient processing pipelines. In this work, we propose a distributed ensemble-based NIDS designed to improve both accuracy and scalability in large-scale network environments. Our approach leverages a Big Data framework to decouple event ingestion from inference, ensuring high-speed processing without sacrificing detection performance. We implement our system using Apache Spark and Apache Kafka, enabling real-time event ingestion, efficient model inference, and periodic model updates through distributed storage. The ensemble classification scheme enhances generalization capabilities by combining multiple classifiers, reducing accuracy loss in cross-dataset scenarios. Experimental evaluations conducted on three benchmark datasets—UNSW-NB15, CS-CIC-IDS, and BoT-IoT—demonstrate that our proposed approach consistently outperforms traditional techniques. Our model achieves an F-Measure improvement of up to 0.46 in cross-dataset evaluations, addressing the generalization limitations of individual classifiers. Additionally, it achieves near real-time inference throughput comparable to traditional classifiers, processing up to 1.07M events per second with three workers, while our distributed training pipeline scales efficiently, reducing model training time by up to 62% in the same setup.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom