Redundant computing for exascale systems. | Zendy

Jon Stearley | Zendy; Rolf Riesen | Zendy; James Laros | Zendy; Kurt Brian Ferreira | Zendy; Kevin Pedretti | Zendy; Ron A. Oldfield | Zendy; Ronald B. Brightwell | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Redundant computing for exascale systems.

Author(s) -

Jon Stearley,

Rolf Riesen,

James Laros,

Kurt Brian Ferreira,

Kevin Pedretti,

Ron A. Oldfield,

Ronald B. Brightwell

Publication year - 2010

Language(s) - English

Resource type - Reports

DOI - 10.2172/1011662

Subject(s) - redundancy (engineering) , computer science , interrupt , exascale computing , fault tolerance , resilience (materials science) , distributed computing , embedded system , reliability engineering , parallel computing , supercomputer , operating system , engineering , physics , thermodynamics , microcontroller

Exascale systems will have hundred thousands of compute nodes and millions of components which increases the likelihood of faults. Today, applications use checkpoint/restart to recover from these faults. Even under ideal conditions, applications running on more than 50,000 nodes will spend more than half of their total running time saving checkpoints, restarting, and redoing work that was lost. Redundant computing is a method that allows an application to continue working even when failures occur. Instead of each failure causing an application interrupt, multiple failures can be absorbed by the application until redundancy is exhausted. In this paper we present a method to analyze the benefits of redundant computing, present simulation results of the cost, and compare it to other proposed methods for fault resilience

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research