Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping
Author(s) -
Marc Gamell,
Keita Teranishi,
Hemanth Kolla,
Jackson R. Mayo,
Michael A. Heroux,
Jacqueline Chen,
Manish Parashar
Publication year - 2017
Publication title -
siam journal on scientific computing
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.674
H-Index - 147
eISSN - 1095-7197
pISSN - 1064-8275
DOI - 10.1137/16m1081610
Subject(s) - computer science , scalability , stencil , parallel computing , titan (rocket family) , overhead (engineering) , distributed computing , resilience (materials science) , fault tolerance , computation , node (physics) , masking (illustration) , computer engineering , algorithm , computational science , programming language , art , physics , structural engineering , database , engineering , visual arts , thermodynamics , aerospace engineering
In order to achieve exascale systems, application resilience needs to be addressed. Some programming models, such as task-DAG (directed acyclic graphs) architectures, currently embed resilience features whereas traditional SPMD (single program, multiple data) and message-passing models do not. Since a large part of the community's code base follows the latter models, it is still required to take advantage of application characteristics to minimize the overheads of fault tolerance. To that end, this paper explores how recovering from hard process/node failures in a local manner is a natural approach for certain applications to obtain resilience at lower costs in faulty environments. In particular, this paper targets enabling online, semitransparent local recovery for stencil computations on current leadership-class systems as well as presents programming support and scalable runtime mechanisms. Also described and demonstrated in this paper is the effect of failure masking, which allows the effective reduct...
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom