Dependability analysis for characterizing Google cluster reliability | Zendy

Mesbahi Mohammad Reza | Zendy; Rahmani Amir Masoud | Zendy; Hosseinzadeh Mehdi | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Dependability analysis for characterizing Google cluster reliability

Author(s) -

Mesbahi Mohammad Reza,

Rahmani Amir Masoud,

Hosseinzadeh Mehdi

Publication year - 2019

Publication title -

international journal of communication systems

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.344

H-Index - 49

eISSN - 1099-1131

pISSN - 1074-5351

DOI - 10.1002/dac.4127

Subject(s) - computer science , unavailability , dependability , cloud computing , server , reliability (semiconductor) , cluster (spacecraft) , failure rate , high availability , mean time between failures , operating system , distributed computing , reliability engineering , software engineering , power (physics) , physics , quantum mechanics , engineering

Summary Cloud solutions are emerging as a new suitable way of transforming traditional IT data centers to highly available and reliable computing resources for hosting critical applications and data. However, software and hardware failures are a common problem in cloud datacenters that can lead to harmful damages. In this paper, we analyze the physical server failures in the Google cloud datacenter. We study the Google cluster properties to investigate the relationship among physical servers' failure rate and jobs failure events. The failure rate of Google cluster executed jobs and servers is taken into consideration during a 29‐day period. We present a reliability model for Google cluster physical machines using the continuous time Markov chains according to this observation. We attempt to analyze the obtained model through SHARPE software packages to improve the understanding of failure events in the Google cloud cluster. We also explore the cluster availability based on parameters like steady‐state availability, steady‐state unavailability, mean time to failure, and mean time to repair in the Google cluster.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research