z-logo
open-access-imgOpen Access
Reliability assessment of cluster supercomputer configuration
Author(s) -
Л. И. Кульбак,
O. P. Tchij,
N. N. Paramonov,
A. G. Rymarchuk,
Т. С. Мартинович
Publication year - 2019
Publication title -
vescì nacyânalʹnaj akadèmìì navuk belarusì. seryâ fìzìka-tèhnìčnyh navuk
Language(s) - English
Resource type - Journals
eISSN - 2524-244X
pISSN - 1561-8358
DOI - 10.29235/1561-8358-2019-64-3-347-358
Subject(s) - supercomputer , cluster (spacecraft) , reliability (semiconductor) , computer science , component (thermodynamics) , reliability engineering , parallel computing , operating system , engineering , physics , power (physics) , quantum mechanics , thermodynamics
The study of reliability indicators was carried out on the example of a cluster supercomputer configuration of “SKIF-GEO” (further cluster) worked out within the framework of the scientific and technical program “SKIF-Nedra” (2015–2018, Program of the Union State of Russia and Belarus). The cluster is a stationary supercomputer configuration designed to solve resource-intensive applications in data processing centers (DPC). Computing platforms and other cluster modules are located in the same 19′′ rack height of 42U. Theoretical peak performance of cluster – 100 Tflop/s. The basic architectural principles implemented in the cluster, the composition and structural-functional scheme of the cluster are given. A methodological support for calculating the reliability of the cluster, based on previous studies by the authors, is proposed. Taking into account these studies, the structural scheme of reliability (SSR) of the cluster, consisting of two parts – the cluster core and the combination of computing facilities (nodes) (CCF), is substantiated. The component parts (CP) include components of the cluster, the failure of which leads to a decrease in performance to zero. CCF includes CP of cluster, the failures of which lead to a decrease in cluster performance. The choice of the main indicators of the reliability of the cluster core and CCF is grounded and formulas for calculating these indicators are given. The analysis of the consequences of failures of cluster components is made. Taking into account the analysis, the SSR of the cluster core is determined, which allows to derive a formula for calculating the cluster core reliability indicators. A mathematical model of reliability (state graph) of an CCF cluster is proposed, which allows one to derive formulas for calculating the mean time to failure and the mean time for a failure of the CCF of cluster. An assessment of the reliability of CP cluster, for which there is no reliable information on their reliability, is determined based on the SSR of these CP. An assessment of the reliability of the cluster as a whole, based on the calculation of reliability indicators based on reference data on the reliability of components and components, as well as on data from the operation of supercomputers of family “SKIF” has been carried out. Taking into account this estimation and the calculated ratios obtained, the cluster reliability indicators for two options were calculated (in the presence and absence of a reserve of computing nodes). High values of cluster reliability indicators were achieved due to the architectural and structural solutions adopted in the process of its development, aimed at increasing its survivability.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here