z-logo
Premium
Resilient gossip algorithms for collecting online management information in exascale clusters
Author(s) -
Barak Am,
Drezner Zvi,
Levy Ely,
Lieber Matthias,
Shiloh Am
Publication year - 2015
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.3465
Subject(s) - gossip , computer science , scalability , node (physics) , resilience (materials science) , distributed computing , computer network , protocol (science) , theoretical computer science , algorithm , database , psychology , social psychology , physics , alternative medicine , structural engineering , pathology , engineering , thermodynamics , medicine
Summary Management of forthcoming exascale clusters requires frequent collection of run‐time information about the nodes and the running applications. This paper presents a new paradigm for providing online information to the management system of scalable clusters, consisting of a large number of nodes and one or more masters that manage these nodes. We describe the details of resilient gossip algorithms for sharing local information within subsets of nodes and for sending global information to a master, which holds information on all the nodes. The presented algorithms are decentralized, scalable and resilient, working well even when some nodes fail, without needing any recovery protocol. The paper gives formal expressions for approximating the average ages of the local information at each node and the information collected by the master. It then shows that these results closely match the results of simulations and measurements on a real cluster. The paper also investigates the resilience of the algorithms and the impact on the average age when nodes or masters fail. The main outcome of this paper is that partitioning of large clusters can improve the quality of information available to the management system without increasing the number of messages per node. Copyright © 2015 John Wiley & Sons, Ltd.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here