Open Access
DATA ANALYSIS PLATFORM FOR STREAM AND BATCH DATA PROCESSING ON HYBRID COMPUTING RESOURCES
Author(s) -
Sergey Belov,
Ivan Kadochnikov,
V. Korenkov,
Andrey Reshetnikov,
R. Semenov,
P. Zrelov
Publication year - 2021
Publication title -
9th international conference "distributed computing and grid technologies in science and education"
Language(s) - English
Resource type - Conference proceedings
DOI - 10.54546/mlit.2021.31.67.001
Subject(s) - computer science , big data , spark (programming language) , provisioning , stream processing , distributed computing , data stream mining , software , database , operating system , data mining , programming language
The modern Big Data ecosystem provides tools to build a flexible platform for processing data streams and batch datasets. Supporting both the functioning of modern giant particle physics experiments and the services necessary for the work of many individual physics researchers results in generating and transferring large amounts of semi-structured data. Thus, it is promising to apply cutting-edge technologies to study these data flows and make the services' provisioning more effective. In this work, we describe the structure and implementation of our data analysis platform, built on the Apache Spark cluster. With the official support for GPU computing now available in Spark version 3, we propose a change in the architecture to utilize these more performant resources while keeping the platform's functionality provided by using mainstream Big Data software. Furthermore, the necessity for GPU support entails a change in the computing resource management infrastructure from Apache Mesos to Kubernetes. Finally, to demonstrate the features and operation of the system, we use the task of network packet analysis for security monitoring and anomaly detection in both batch and streammodes.