z-logo
open-access-imgOpen Access
Lightweight storage and overlay networks for fault tolerance.
Author(s) -
Ron A. Oldfield
Publication year - 2010
Publication title -
osti oai (u.s. department of energy office of scientific and technical information)
Language(s) - English
Resource type - Reports
DOI - 10.2172/989384
Subject(s) - computer science , scalability , fault tolerance , distributed computing , overhead (engineering) , cache , overlay , overlay network , node (physics) , computer network , embedded system , operating system , engineering , the internet , structural engineering
The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands to millions of processors, In such environments, it is critical to have fault-tolerance mechanisms, including checkpoint/restart, that scale with the size of applications and the percentage of the system on which the applications execute. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a scalable solution. For example, on today's massive-scale systems that execute applications which consume most of the memory of the employed compute nodes, checkpoint operations generate I/O that consumes nearly 80% of the total I/O usage. Motivated by this observation, this project aims to improve I/O performance for application-directed checkpoints through the use of lightweight storage architectures and overlay networks. Lightweight storage provide direct access to underlying storage devices. Overlay networks provide caching and processing capabilities in the compute-node fabric. The combination has potential to signifcantly reduce I/O overhead for large-scale applications. This report describes our combined efforts to model and understand overheads for application-directed checkpoints, as well as implementation and performance analysis of a checkpoint service that uses available compute nodes as a network cache for checkpoint operations.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom