Premium
A lightweight software fault‐tolerance system in the cloud environment
Author(s) -
Chen Gang,
Jin Hai,
Zou Deqing,
Zhou Bing Bing,
Qiang Weizhong
Publication year - 2013
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.3190
Subject(s) - computer science , cloud computing , operating system , virtual machine , virtualization , fault tolerance , software , redundancy (engineering) , embedded system , cache , overhead (engineering) , high availability , software fault tolerance , hypervisor , distributed computing
Summary With the development of cloud computing, the demand of high availability for services is growing. Unfortunately, software failures greatly reduce system availability. This paper presents a lightweight software fault‐tolerance system, called SHelp, which can effectively recover programs from many types of software bugs in the cloud environment. With error virtualization techniques, it proposes ‘weighted’ rescue points techniques to effectively survive software failures through bypassing the faulty path. For multiple application instances running on different virtual machine, a three‐level storage hierarchy with several comprehensive cache updating algorithms for rescue points management is adopted to share error handling information. On the one hand, SHelp can reduce the redundancy for multiple application instances; on the other hand, it can more effectively and quickly recover from faults caused by the same bugs. A Linux prototype is implemented on an open‐source virtual machine monitor platform, Xen, and evaluated using four Web server applications that contain various types of bugs. The experimental results show that SHelp can recover server applications from these bugs in just a few seconds with modest performance overhead. Copyright © 2013 John Wiley & Sons, Ltd.