
Enhancing Cloud Job Failure Prediction with a Novel Multilayer Voting-Based Framework
Author(s) -
Ahmed Elkaradwy,
Ayman Elshenawy,
Hany Harb
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3593808
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
In modern cloud data centers, accurately predicting job failures before they occur is essential for ensuring system reliability, availability, and efficiency. To address this challenge, researchers have progressively developed machine learning and deep learning techniques that examine cloud logs to identify patterns linked to such failures. To surpass the prediction accuracy of earlier models, this paper presents the Multilayer Multi-Prediction Framework (MMPF), a novel approach specifically designed to enhance failure prediction accuracy. The framework is divided into two distinct layers. The first layer aims to detect job failures using a fine-tuned voting mechanism, while the second layer identifies the specific failure type. In the first layer, several classifiers are integrated to build a robust predictive model, which outperforms conventional methods that rely on a single classifier. By aggregating the predictions from multiple base classifiers and making the final decision based on a weighted average of the predicted probabilities, this approach delivers significantly higher accuracy in failure prediction. Decision Trees, K-Nearest Neighbors, Extreme Gradient Boosting, Adaptive Boosting, and Artificial Neural Networks are the classifiers implemented in this layer. Once the failed jobs are detected, the second layer takes over to determine the failure type employing the Random Forest algorithm. Based on experiments conducted with the Google Cluster 2019 trace dataset, the framework achieved impressive accuracy: 99.83% in detecting failed jobs and 99.97% in identifying the nature of the failure.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom