Scalable Approach to Failure Analysis of High‐Performance Computing Systems | Zendy

Shawky Doaa | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Scalable Approach to Failure Analysis of High‐Performance Computing Systems

Author(s) -

Shawky Doaa

Publication year - 2014

Publication title -

etri journal

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.295

H-Index - 46

eISSN - 2233-7326

pISSN - 1225-6463

DOI - 10.4218/etrij.14.0113.1133

Subject(s) - scalability , computer science , workload , root cause , reliability (semiconductor) , task (project management) , root cause analysis , node (physics) , reliability engineering , set (abstract data type) , data mining , distributed computing , engineering , database , operating system , power (physics) , physics , systems engineering , structural engineering , quantum mechanics , programming language

Failure analysis is necessary to clarify the root cause of a failure, predict the next time a failure may occur, and improve the performance and reliability of a system. However, it is not an easy task to analyze and interpret failure data, especially for complex systems. Usually, these data are represented using many attributes, and sometimes they are inconsistent and ambiguous. In this paper, we present a scalable approach for the analysis and interpretation of failure data of high‐performance computing systems. The approach employs rough sets theory (RST) for this task. The application of RST to a large publicly available set of failure data highlights the main attributes responsible for the root cause of a failure. In addition, it is used to analyze other failure characteristics, such as time between failures, repair times, workload running on a failed node, and failure category. Experimental results show the scalability of the presented approach and its ability to reveal dependencies among different failure characteristics.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore