Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems
Author(s) -
Jinsuk Chung,
Ikhwan Lee,
Michael B. Sullivan,
Jee Ho Ryoo,
Dong Wan Kim,
Doe Hyun Yoon,
Larry Kaplan,
Mattan Erez
Publication year - 2013
Publication title -
scientific programming
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.269
H-Index - 36
eISSN - 1875-919X
pISSN - 1058-9244
DOI - 10.1155/2013/473915
Subject(s) - containment (computer programming) , computer science , scalability , distributed computing , trace (psycholinguistics) , resilience (materials science) , construct (python library) , scheme (mathematics) , semantics (computer science) , state (computer science) , computer network , programming language , database , mathematical analysis , linguistics , philosophy , physics , mathematics , thermodynamics
This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schemes. Containment domains have weak transactional semantics and are nested to take advantage of the machine and application hierarchies and to enable hierarchical state preservation, restoration and recovery. We evaluate the scalability and efficiency of containment domains using generalized trace-driven simulation and analytical analysis and show that containment domains are superior to both checkpoint restart and redundant execution approaches.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom