z-logo
open-access-imgOpen Access
Strategies for Fault Tolerance in Multicomponent Applications
Author(s) -
Aniruddha G. Shet,
Wael Elwasif,
Samantha S. Foley,
Byung H. Park,
David E. Bernholdt,
Randall Bramley
Publication year - 2011
Publication title -
procedia computer science
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.334
H-Index - 76
ISSN - 1877-0509
DOI - 10.1016/j.procs.2011.04.249
Subject(s) - computer science , backplane , fault tolerance , distributed computing , supercomputer , dissemination , context (archaeology) , reliability (semiconductor) , publication , event (particle physics) , reliability engineering , operating system , telecommunications , paleontology , power (physics) , physics , quantum mechanics , advertising , engineering , business , biology
This paper discusses on-going work with the Integrated Plasma Simulator (IPS), a framework for coupled multiphysics simulations of plasmas, to allow simulations to run through the loss of nodes on which the simulation is executing.While many different techniques are available to improve the fault tolerance of computational science applications on high-performance computer systems, checkpoint/restart (C/R) remains virtually the only one that see widespread use in practice. Our focus here is to augment the traditional C/R approach with additional techniques that can provide a more localized and tailored response to faults based on the ability to restart failed tasks on an individual basis, and the use of information external to the application itself in order to guide decision-making, in many cases avoiding the need to stop and restart the entire simulation. This capability involves several features within the IPS framework, and leverages the Fault Tolerance Backplane, a publish/subscribe event service to disseminate fault-related information throughout HPC systems, to obtain information from the Reliability, Availability and Serviceability (RAS) subsystem of the HPC system.This work is described in the context of Cray XT-series computer systems for concreteness, but is applicable to other environments as well. As part of the analysis of this work, we discuss the requirements to generalize this approach to other complex simulation applications beyond the Integrated Plasma Simulator

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom