z-logo
open-access-imgOpen Access
Fault Tolerance Assistant (FTA): An Exception Handling Programming Model for MPI Applications
Author(s) -
Aiman Fang,
Ignacio Laguna,
Kento Sato,
Tanzima Islam,
Kathryn Mohror
Publication year - 2016
Language(s) - English
Resource type - Reports
DOI - 10.2172/1258538
Subject(s) - computer science , fault tolerance , exploit , resilience (materials science) , distributed computing , implementation , process (computing) , programming paradigm , process migration , parallel computing , message passing , software fault tolerance , supercomputer , code (set theory) , embedded system , operating system , computer security , programming language , set (abstract data type) , physics , thermodynamics
Future high-performance computing systems may face frequent failures with their rapid increase in scale and complexity. Resilience to faults has become a major challenge for large-scale applications running on supercomputers, which demands fault tolerance support for prevalent MPI applications. Among failure scenarios, process failures are one of the most severe issues as they usually lead to termination of applications. However, the widely used MPI implementations do not provide mechanisms for fault tolerance. We propose FTA-MPI (Fault Tolerance Assistant MPI), a programming model that provides support for failure detection, failure notification and recovery. Specifically, FTA-MPI exploits a try/catch model that enables failure localization and transparent recovery of process failures in MPI applications. We demonstrate FTA-MPI with synthetic applications and a molecular dynamics code CoMD, and show that FTA-MPI provides high programmability for users and enables convenient and flexible recovery of process failures.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom