z-logo
Premium
A fault‐tolerant routing strategy for k ‐ary n ‐direct s ‐indirect topologies based on intermediate nodes
Author(s) -
Peñaranda Roberto,
Gómez María Engracia,
López Pedro,
Gunnar Gran Ernst,
Skeie Tor
Publication year - 2017
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.4065
Subject(s) - fault tolerance , computer science , network topology , node (physics) , interconnection , network packet , distributed computing , routing (electronic design automation) , computer network , topology (electrical circuits) , engineering , electrical engineering , structural engineering
Summary Exascale computing systems are being built with thousands of nodes. The high number of components of these systems significantly increases the probability of failure. A key component for them is the interconnection network. If failures occur in the interconnection network, they may isolate a large fraction of the machine. For this reason, an efficient fault‐tolerant mechanism is needed to keep the system interconnected, even in the presence of faults. A recently proposed topology for these large systems is the hybrid k ‐ary n ‐direct s ‐indirect family that provides optimal performance and connectivity at a reduced hardware cost. This paper presents a fault‐tolerant routing methodology for the k ‐ary n ‐direct s ‐indirect topology that degrades performance gracefully in presence of faults and tolerates a large number of faults without disabling any healthy computing node. In order to tolerate network failures, the methodology uses a simple mechanism. For any source‐destination pair, if necessary, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network) with the aim of circumventing faults. The evaluation results shows that the proposed methodology tolerates a large number of faults. For instance, it is able to tolerate more than 99.5% of fault combinations when there are 10 faults in a 3‐D network with 1000 nodes using only 1 intermediate node and more than 99.98% if 2 intermediate nodes are used. Furthermore, the methodology offers a gracious performance degradation. As an example, performance degrades only by 1% for a 2‐D network with 1024 nodes and 1% faulty links.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom