
Scalable Approach to Failure Analysis of High‐Performance Computing Systems
Author(s) -
Shawky Doaa
Publication year - 2014
Publication title -
etri journal
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.295
H-Index - 46
eISSN - 2233-7326
pISSN - 1225-6463
DOI - 10.4218/etrij.14.0113.1133
Subject(s) - scalability , computer science , workload , root cause , reliability (semiconductor) , task (project management) , root cause analysis , node (physics) , reliability engineering , set (abstract data type) , data mining , distributed computing , engineering , database , operating system , power (physics) , physics , systems engineering , structural engineering , quantum mechanics , programming language
Failure analysis is necessary to clarify the root cause of a failure, predict the next time a failure may occur, and improve the performance and reliability of a system. However, it is not an easy task to analyze and interpret failure data, especially for complex systems. Usually, these data are represented using many attributes, and sometimes they are inconsistent and ambiguous. In this paper, we present a scalable approach for the analysis and interpretation of failure data of high‐performance computing systems. The approach employs rough sets theory (RST) for this task. The application of RST to a large publicly available set of failure data highlights the main attributes responsible for the root cause of a failure. In addition, it is used to analyze other failure characteristics, such as time between failures, repair times, workload running on a failed node, and failure category. Experimental results show the scalability of the presented approach and its ability to reveal dependencies among different failure characteristics.