z-logo
open-access-imgOpen Access
Evaluating operating system vulnerability to memory errors.
Author(s) -
Kurt Brian Ferreira,
Patrick G. Bridges,
Kevin Pedretti,
Frank Mueller,
David Fiala,
Ronald B. Brightwell
Publication year - 2012
Language(s) - English
Resource type - Reports
DOI - 10.2172/1044952
Subject(s) - computer science , memory footprint , scalability , memory errors , node (physics) , virtual memory , reliability engineering , vulnerability (computing) , reliability (semiconductor) , software , operating system , memory management , embedded system , overlay , engineering , computer security , philosophy , recall , linguistics , power (physics) , physics , structural engineering , quantum mechanics
Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute node's physical memory, recent studies show more memory errors in this region of memory than the remainder of the system. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high-performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high-performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft errors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here