Orchestrating Fault Prediction with Live Migration and Checkpointing | Zendy

Subhendu Behera | Zendy; Lipeng Wan | Zendy; Frank Mueller | Zendy; Matthew Wolf | Zendy; Scott Klasky | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Orchestrating Fault Prediction with Live Migration and Checkpointing

Author(s) -

Subhendu Behera,

Lipeng Wan,

Frank Mueller,

Matthew Wolf,

Scott Klasky

Publication year - 2020

Publication title -

osti oai (u.s. department of energy office of scientific and technical information)

Language(s) - English

Resource type - Conference proceedings

ISBN - 978-1-4503-7052-3

DOI - 10.1145/3369583.3392672

Subject(s) - overhead (engineering) , computer science , supercomputer , fault tolerance , summit , parallel computing , distributed computing , embedded system , operating system , physical geography , geography

Checkpoint/Restart (C/R) is widely used to provide fault tolerance on High-Performance Computing (HPC) systems. However, Parallel File System (PFS) overhead and failure uncertainty cause significant application overhead. This paper develops an adaptive multi-level C/R model that incorporates a failure prediction and analysis model, which orchestrates failure prediction, checkpointing, checkpoint frequency, and proactive live migration along with the additional benefit of Burst Buffers (BB). It effectively reduces the overheads due to failures, checkpointing, and recovery. Simulation results for the Summit supercomputer yield a reduction of ~20%-86% in application overhead due to BBs, orchestrated failure prediction, and migration. We also observe a ~29% decrease in checkpoint writes to BBs, which can increase the longevity of the BB storage devices.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research