Exploiting Data-Flow for Fault-Tolerance in a Wide-Area Parallel System
Author(s) -
Anh Nguyen-Tuong,
Andrew S. Grimshaw,
Mark Hyett
Publication year - 1996
Language(s) - English
DOI - 10.1109/srds.1996.10001
Wide-area parallel processing systems will soon be available to researchers to solve a range of problems. In these systems, it is certain that host failures and other faults will be a common occurrence. Unfortunateb, most parallel processing systems have not been designed with fault-tolerance in mind. Mentat is a high-performance objec t-oriented parallel processing system that is based on an extension of the data-flow model. The functional nature of data-flow enabies both parallelism and faulttolerance. In this paper, we exploit the data-flow underpinning of Mentat to provide easy-to-use and transparent fault-tolerance. We present results on both a small-scale network and a wide-area heterogeneous environment that consists of three sites: the National Center for Supercomputing Applications, the University of Mrginia and the NASA Langley Research Center.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom