z-logo
open-access-imgOpen Access
rMPI : increasing fault resiliency in a message-passing environment.
Author(s) -
Jon Stearley,
James H. Laros,
Kurt Brian Ferreira,
Kevin Pedretti,
Ron A. Oldfield,
Rolf Riesen,
Ronald B. Brightwell
Publication year - 2011
Language(s) - English
Resource type - Reports
DOI - 10.2172/1012733
Subject(s) - scalability , computer science , replica , computation , fault tolerance , distributed computing , reliability (semiconductor) , consistency (knowledge bases) , message passing , scale (ratio) , limit (mathematics) , parallel computing , operating system , programming language , art , mathematical analysis , power (physics) , physics , mathematics , quantum mechanics , artificial intelligence , visual arts
As High-End Computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are unsuitable at these scale due to excessive overheads predicted to more than double an applications time to solution. Redundant computation, long used in distributed and mission critical systems, has been suggested as an alternative to checkpoint-restart on its own. In this paper we describe the rMPI library which enables portable and transparent redundant computation for MPI applications. We detail the design of the library as well as two replica consistency protocols, outline the overheads of this library at scale on a number of real-world applications, and finally outline the significant increase in an applications time to solution at extreme scale as well as show the scenarios in which redundant computation makes sense

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom