z-logo
open-access-imgOpen Access
A Lightweight and Flexible Tool for Distinguishing Between Hardware Malfunctions and Program Bugs in Debugging Large-Scale Programs
Author(s) -
Guozhen Zhang,
Yi Liu,
Hailong Yang,
Depei Qian
Publication year - 2018
Publication title -
ieee access
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.587
H-Index - 127
ISSN - 2169-3536
DOI - 10.1109/access.2018.2882394
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
In this paper, we propose a new technique to distinguish the reason for program failure between hardware malfunctions and program bugs, which mitigates the impact of shorter mean time between failures to the debugging process on the future exa-scale supercomputers and improves the productivity of debugging large-scale parallel programs. Our technique detects program failures by observing the abnormal message passing behaviors with distributed monitors and leverages event-driven mechanism to trigger global status checking among different node groups concurrently. Besides, both coarse-grained execution snapshots and fine-grained failure events can be provided for further failure diagnosis and bug analysis. We implement this technique as a user-space library named failure cause resolver (FCR). Experimental results on the Tianhe-2 supercomputer demonstrate that the latency of FCR for failure detection is acceptable with negligible overhead. In addition, FCR does not require administrative privilege and can be easily integrated into existing large-scale parallel programs.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom