Premium
The STAR fault manager for distributed operating environments. design, implementation and performance
Author(s) -
Sens Pierre,
Folliot Bertil
Publication year - 1998
Publication title -
software: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.437
H-Index - 70
eISSN - 1097-024X
pISSN - 0038-0644
DOI - 10.1002/(sici)1097-024x(199808)28:10<1079::aid-spe199>3.0.co;2-d
Subject(s) - computer science , unix , fault tolerance , redundancy (engineering) , backup , operating system , software , star (game theory) , software fault tolerance , embedded system , distributed computing , mathematical analysis , mathematics
This paper presents the design, implementation and performance evaluation of a software fault manager for distributed applications. Dubbed Star, it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. To improve the response time of fault‐tolerant applications, Star implements non‐blocking and incremental checkpointing to perform an efficient backup of process state. Moreover, Star is application independent, highly configurable. Star actually runs on top of SunOs and is easily portable to UNIX™‐like operating systems. The current implementation is based on independent checkpointing and message logging. Measurements show the efficiency and the limits of this implementation. The challenge is to show that a software approach to fault tolerance can efficiently be implemented in a standard networked environment. © 1998 John Wiley & Sons, Ltd.