Premium
Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster
Author(s) -
Zhang Jinghui,
Zhan Jun,
Li Jiange,
Jin Jiahui,
Qian Lei
Publication year - 2020
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.5923
Subject(s) - computer science , data parallelism , pipeline (software) , parallel computing , gpu cluster , speedup , workload , memory footprint , parallelism (grammar) , partition (number theory) , distributed computing , cuda , computer architecture , operating system , mathematics , combinatorics
Summary Exorbitant resources (computing and memory) are required to train a deep neural network (DNN). Often researchers deploy an approach that uses distributed parallel training to acquire larger models faster on GPUs. This approach has its detriments, though; on one hand, a GPU's expanded capacity to compute also produces bigger bottlenecks in inter‐GPU's communications during model training, and multi‐GPU systems lead to complex connectivity. Workload schedulers then end up having to consider hardware topology and requirements for workload communication, in hopes of allocating GPU resources to optimize execution time and improve usage in a heterogeneous environment. On the other hand, the high memory requirements to train a DNN model make running the training processes on GPUs onerous. To contend with this, we introduce two execution optimization methods based on pipeline‐hybrid parallelism (using both data and model parallelism) in a GPU cluster with heterogeneous networking. First, we propose a model partition algorithm that accelerates pipeline‐hybrid parallelism training between heterogeneously network‐connected GPUs. Second, we introduce a cost‐balanced recomputing algorithm to reduce memory usage in the pipeline mode. Experiments show that our solution (Pipe‐Torch) averages a speedup of 1.4 × compared with data parallelism, and reduces the memory footprint while maintaining pipelined load‐balanced training.