Fault Tolerance Schemes for Global Load Balancing in X10
Author(s) -
Claudia Fohry,
Marco Bungart,
Jonas Posner
Publication year - 2015
Publication title -
scalable computing practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.192
H-Index - 18
ISSN - 1895-1767
DOI - 10.12694/scpe.v16i2.1088
Subject(s) - computer science , asynchronous communication , scalability , fault tolerance , distributed computing , load balancing (electrical power) , overhead (engineering) , parallel computing , computer network , operating system , grid , geometry , mathematics
Scalability postulates fault tolerance to be efficient. One approach handles permanent node failures at user level. It is supported by Resilient X10, a Partitioned Global Address Space language that throws an exception when a place fails. We consider task pools, which are a widely used pattern for load balancing of irregular applications, and refer to the variant that is implemented in the Global Load Balancing framework GLB of X10. Here, each worker maintains a private pool and supports cooperative work stealing. Victim selection and termination detection follow the lifeline scheme. Tasks may generate new tasks dynamically, are free of side-effects, and their results are combined by reduction. We consider a single worker per node, and assume that failures are rare and uncorrelated. The paper introduces two fault tolerance schemes. Both are based on regular backups of the local task pool contents, which are written to the main memory of another worker and updated in the event of stealing. The first scheme mainly relies on synchronous communication. The second scheme deploys asynchronous communication, and significantly improves on the first scheme efficiency and robustness. Both schemes have been implemented by extending the GLB source code. Experiments were run with the Unbalanced Tree Search (UTS) and Betweenness Centrality benchmarks. For UTS on 128 nodes, for instance, we observed an overhead of about 81% with the synchronous scheme and about 7% with the asynchronous scheme. The protocol overhead for a place failure was negligible.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom