Premium
Trading off logging overhead and coordinating overhead to achieve efficient rollback recovery
Author(s) -
Yang JinMin,
Li Kin Fun,
Li WenWei,
Zhang DaFang
Publication year - 2009
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.1364
Subject(s) - scalability , overhead (engineering) , computer science , rollback , logging , fault tolerance , partition (number theory) , heuristics , distributed computing , embedded system , operating system , database transaction , database , ecology , mathematics , combinatorics , biology
In the rollback recovery of large‐scale long‐running applications in a distributed environment, pessimistic message logging protocols enable failed processes to recover independently, though at the expense of logging every message synchronously during fault‐free execution. In contrast, coordinated checkpointing protocols avoid message logging, but they are poor in scalability with a sharply increased coordinating overhead as the system grows. With the aim of achieving efficient rollback recovery by trading off logging overhead and coordinating overhead, this paper suggests a partitioning of the system into clusters, and then presents a scheme to implement the conversion between these overheads. Using the proposed conversion, coordination can be introduced to reduce the unbearable logging overhead found in some systems, whereas proper logging can be employed to alleviate the unacceptable coordinating overhead in others. Furthermore, heuristics are introduced to address the issue of how to partition the system into clusters in order to speed up the recovery process and to improve recovery efficiency. Performance evaluation results indicate that our scheme can lower the overall system overhead effectively. Copyright © 2008 John Wiley & Sons, Ltd.