Toward resilient algorithms and applications
Author(s) -
Michael A. Heroux
Publication year - 2013
Publication title -
osti oai (u.s. department of energy office of scientific and technical information)
Language(s) - English
Resource type - Conference proceedings
DOI - 10.1145/2465813.2465814
Subject(s) - computer science , key (lock) , reliability (semiconductor) , state (computer science) , scale (ratio) , distributed computing , algorithm , computer security , physics , power (physics) , quantum mechanics
Large-scale computing platforms have always dealt with unreliability coming from many sources. In contrast applications for large-scale systems have generally assumed a fairly simplistic failure model: The computer is a reliable digital machine, with consistent execution time and infrequent failures that can be handled by occasionally storing a checkpoint of application state and restarting from that saved state if the system fails. Many computing experts, and several key technology trends indicate that the current simplistic application view of a high-end system is no longer feasible. Instead, algorithms and application developers must adopt more complex models for system reliability and adapt algorithms and implementation to be more resilient in the presence of failures and increased failure detection and correction. In this talk we present motivation for moving away from a checkpoint-restart-only model and discuss several new models for resilience, including latency tolerance, local recovery from local failure and selective reliability. We also discuss strategies for designing new algorithms and applications, and some of the required system and programming environment features.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom