Premium
Dependability analysis for characterizing Google cluster reliability
Author(s) -
Mesbahi Mohammad Reza,
Rahmani Amir Masoud,
Hosseinzadeh Mehdi
Publication year - 2019
Publication title -
international journal of communication systems
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.344
H-Index - 49
eISSN - 1099-1131
pISSN - 1074-5351
DOI - 10.1002/dac.4127
Subject(s) - computer science , unavailability , dependability , cloud computing , server , reliability (semiconductor) , cluster (spacecraft) , failure rate , high availability , mean time between failures , operating system , distributed computing , reliability engineering , software engineering , power (physics) , physics , quantum mechanics , engineering
Summary Cloud solutions are emerging as a new suitable way of transforming traditional IT data centers to highly available and reliable computing resources for hosting critical applications and data. However, software and hardware failures are a common problem in cloud datacenters that can lead to harmful damages. In this paper, we analyze the physical server failures in the Google cloud datacenter. We study the Google cluster properties to investigate the relationship among physical servers' failure rate and jobs failure events. The failure rate of Google cluster executed jobs and servers is taken into consideration during a 29‐day period. We present a reliability model for Google cluster physical machines using the continuous time Markov chains according to this observation. We attempt to analyze the obtained model through SHARPE software packages to improve the understanding of failure events in the Google cloud cluster. We also explore the cluster availability based on parameters like steady‐state availability, steady‐state unavailability, mean time to failure, and mean time to repair in the Google cluster.