Premium
Redesigning the rCUDA communication layer for a better adaptation to the underlying hardware
Author(s) -
Reaño Carlos,
Silla Federico
Publication year - 2019
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.5481
Subject(s) - computer science , infiniband , virtualization , full virtualization , leverage (statistics) , storage virtualization , embedded system , bandwidth (computing) , remote direct memory access , supercomputer , operating system , cuda , computer architecture , computer hardware , cloud computing , computer network , machine learning
Summary The use of Graphics Processing Units (GPUs) has become a very popular way to accelerate the execution of many applications. However, GPUs are not exempt from side effects. For instance, GPUs are expensive devices which additionally consume a non‐negligible amount of energy even when they are not performing any computation. Furthermore, most applications present low GPU utilization. To address these concerns, the use of GPU virtualization has been proposed. In particular, remote GPU virtualization is a promising technology that allows applications to transparently leverage GPUs installed in any node of the cluster. In this paper, the remote GPU virtualization mechanism is comparatively analyzed across three different generations of GPUs. The first contribution of this study is an analysis about how the performance of the remote GPU virtualization technique is impacted by the underlying hardware. To that end, the Tesla K20, Tesla K40, and Tesla P100 GPUs along with FDR and EDR InfiniBand fabrics are used in the study. The analysis is performed in the context of the rCUDA middleware. It is clearly shown that the GPU virtualization middleware requires a comprehensive design of its communication layer, which should be perfectly adapted to every hardware generation in order to avoid a reduction in performance. This is precisely the second contribution of this work, ie, redesigning the rCUDA communication layer in order to improve the management of the underlying hardware. Results show that it is possible to improve bandwidth up to 29.43%, which translates into up to 4.81% average less execution time in the performance of the analyzed applications.