z-logo
open-access-imgOpen Access
ULFM-MPI Implementation of a Resilient Task-Based Partial Differential Equations Preconditioner
Author(s) -
Francesco Rizzi,
Karla Morris,
Khachik Sargsyan,
Paul Mycek,
Cosmin Safta,
Bert Debusschere,
Olivier Le Maı̂tre,
Omar Knio
Publication year - 2016
Publication title -
osti oai (u.s. department of energy office of scientific and technical information)
Language(s) - English
Resource type - Conference proceedings
DOI - 10.1145/2909428.2909429
Subject(s) - preconditioner , computer science , task (project management) , partial differential equation , domain decomposition methods , domain (mathematical analysis) , multigrid method , parallel computing , algorithm , mathematics , mathematical analysis , physics , finite element method , iterative method , management , thermodynamics , economics
We present a task-based domain-decomposition preconditioner for partial differential equations (PDEs) resilient to silent data corruption (SDC) and hard faults. The algorithm exploits a reformulation of the PDE as a sampling problem, followed by a regression-based solution update that is resilient to SDC. We adopt a server-client model implemented using the User Level Fault Mitigation MPI (MPI-ULFM). All state information is held by the servers, while clients only serve as computational units. The task-based nature of the algorithm and the capabilities of ULFM are complemented at the algorithm level to support missing tasks, making the application resilient to hard faults affecting the clients. Weak and strong scaling tests up to ~115k cores show an excellent performance of the application with efficiencies above 90%, demonstrating the suitability to run at large scale. We demonstrate the resilience of the application for a 2D elliptic PDE by injecting SDC using a random single bit-flip model, and hard faults in the form of clients crashing. We show that in all cases, the application converges to the right solution. We analyze the overhead caused by the faults, and show that, for the test problem considered, the overhead incurred due to SDC is minimal compared to that from the hard faults.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom