z-logo
open-access-imgOpen Access
MCREngine: A scalable checkpointing system using data-aware aggregation and compression
Author(s) -
Tanzima Zerin Islam,
Kathryn Mohror,
Saurabh Bagchi,
Adam Moody,
Bronis R. de Supinski,
Rudolf Eigenmann
Publication year - 2013
Publication title -
2012 international conference for high performance computing, networking, storage and analysis
Language(s) - English
Resource type - Conference proceedings
SCImago Journal Rank - 0.363
H-Index - 56
eISSN - 2167-4337
pISSN - 2167-4329
ISBN - 978-1-4673-0806-9
DOI - 10.1109/sc.2012.77
Subject(s) - computing and processing
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpointrestart system, MCRENGINE. MCRENGINE aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that MCRENGINE reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom