Deep Analysis of Job State Statistics on Lomonosov-2 Supercomputer
Author(s) -
Dmitry Nikitenko,
Vadim Voevodin,
Sergey Zhumatiy
Publication year - 2018
Publication title -
supercomputing frontiers and innovations
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.375
H-Index - 16
eISSN - 2409-6008
pISSN - 2313-8734
DOI - 10.14529/jsfi180201
Subject(s) - supercomputer , computer science , job scheduler , scheduling (production processes) , processor scheduling , operating system , operations management , cloud computing , schedule , engineering
It is a common knowledge that the increasingly growing capabilities of HPC systems are always limited by a number of efficiency related issues. The reasons can be very different: hardware failures, incorrect job scheduling, peculiarities of algorithm, chosen programming technology specifics, etc. Most of these issues can be detected after precise analysis, but is a very resourceful way to study every application run. Therefore we performed less complicated analysis of the whole supercomputer job flow. In this paper we share our experience of analyzing user applications’ job states assigned by the SLURM resource manager that is used on the Lomonosov-2 system at Supercomputing center of Lomonosov Moscow State University. The statistics on job states was collected and it revealed that the ratio of correctly finished jobs (with the COMPLETED state) was rather low. The jobs owners were asked if the distribution of their jobs’ states is normal regarding their applications. This user feedback was processed, and some new ways of efficiency gain were revealed as the result.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom