z-logo
open-access-imgOpen Access
Redundant computing for exascale systems.
Author(s) -
Jon Stearley,
Rolf Riesen,
James H. Laros,
Kurt Brian Ferreira,
Kevin Pedretti,
Ron A. Oldfield,
Ronald B. Brightwell
Publication year - 2010
Language(s) - English
Resource type - Reports
DOI - 10.2172/1011662
Subject(s) - redundancy (engineering) , computer science , interrupt , exascale computing , fault tolerance , resilience (materials science) , distributed computing , embedded system , reliability engineering , parallel computing , supercomputer , operating system , engineering , physics , thermodynamics , microcontroller
Exascale systems will have hundred thousands of compute nodes and millions of components which increases the likelihood of faults. Today, applications use checkpoint/restart to recover from these faults. Even under ideal conditions, applications running on more than 50,000 nodes will spend more than half of their total running time saving checkpoints, restarting, and redoing work that was lost. Redundant computing is a method that allows an application to continue working even when failures occur. Instead of each failure causing an application interrupt, multiple failures can be absorbed by the application until redundancy is exhausted. In this paper we present a method to analyze the benefits of redundant computing, present simulation results of the cost, and compare it to other proposed methods for fault resilience

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here