Toward Local Failure Local Recovery Resilience Model using MPI-ULFM
Author(s) -
Keita Teranishi,
Michael A. Heroux
Publication year - 2014
Publication title -
osti oai (u.s. department of energy office of scientific and technical information)
Language(s) - English
Resource type - Conference proceedings
DOI - 10.1145/2642769.2642774
Subject(s) - computer science , scalability , resilience (materials science) , distributed computing , process (computing) , software , parallel computing , operating system , thermodynamics , physics
The current system reaction to the loss of a single MPI process is to kill all the remaining processes and restart the application from the most recent checkpoint. This approach will become unfeasible for future extreme scale systems. We address this issue using an emerging resilient computing model called Local Failure Local Recovery (LFLR) that provides application developers with the ability to recover locally and continue application execution when a process is lost. We discuss the design of our software framework to enable the LFLR model using MPI-ULFM and demonstrate the resilient version of MiniFE that achieves a scalable recovery from process failures.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom