z-logo
open-access-imgOpen Access
Cooperative application/OS DRAM fault recovery.
Author(s) -
Kurt Brian Ferreira,
Patrick G. Bridges,
Michael A. Heroux,
Mark Frederick Hoemmen,
Ronald B. Brightwell
Publication year - 2012
Publication title -
osti oai (u.s. department of energy office of scientific and technical information)
Language(s) - English
Resource type - Reports
DOI - 10.2172/1044954
Subject(s) - dram , computer science , fault tolerance , resilience (materials science) , embedded system , rollback , software , reliability engineering , distributed computing , operating system , parallel computing , computer hardware , database , engineering , physics , database transaction , thermodynamics
Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application/OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom