Using two-level stable storage for efficient checkpointing
Author(s) -
L.M. Silva,
João Gabriel Silva
Publication year - 1998
Publication title -
iee proceedings - software
Language(s) - English
Resource type - Journals
eISSN - 1463-9831
pISSN - 1462-5970
DOI - 10.1049/ip-sen:19982440
Subject(s) - computer science , overhead (engineering) , reliability (semiconductor) , fault tolerance , parallel computing , rollback , distributed computing , embedded system , operating system , database , database transaction , power (physics) , physics , quantum mechanics
Checkpointing and rollback recovery is a very effective technique to tolerate the occurrence of failures. Usually, checkpoint data is saved on disk, however, in some situations the time to write the data to disk can represent a considerable performance overhead. Alternative solutions would make use of main memory to maintain the checkpoint data. The paper starts by presenting two main memory checkpointing schemes: neighbour based and parity checkpointing. Both schemes have been implemented and evaluated in a commercial parallel machine. The results show that neighbour based checkpointing presents a very low performance overhead and assures a fast recovery for partial failures. However, it is not able to tolerate multiple and total failures of the system. To solve this shortcoming the authors propose a two-level stable storage integrating the use of neighbour based with disk based checkpointing. This approach combines the advantages of the two schemes: the efficiency of diskless checkpointing with the high reliability of disk based checkpointing.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom