z-logo
open-access-imgOpen Access
Improved Failure Detection and Propagation Mechanisms for MPI
Author(s) -
Pedro Henrique Di Francia Rosso,
Emílio Francesquini
Publication year - 2021
Language(s) - English
Resource type - Conference proceedings
DOI - 10.5753/eradsp.2021.16702
Subject(s) - software portability , computer science , message passing interface , scalability , overhead (engineering) , fault tolerance , message passing , supercomputer , distributed computing , interface (matter) , parallel computing , operating system , bubble , maximum bubble pressure method
The Message Passing Interface (MPI) standard is largely used in High-Performance Computing (HPC) systems. Such systems employ a large number of computing nodes. Thus, Fault Tolerance (FT) is a concern since a large number of nodes leads to more frequent failures. Two essential components of FT are Failure Detection (FD) and Failure Propagation (FP). This paper proposes improvements to existing FD and FP mechanisms to provide more portability, scalability, and low overhead. Results show that the methods proposed can achieve better or at least similar results to existing methods while providing portability to any MPI standard-compliant distribution.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here