Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster | Zendy

Zhang Jinghui | Zendy; Zhan Jun | Zendy; Li Jiange | Zendy; Jin Jiahui | Zendy; Qian Lei | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

Author(s) -

Zhang Jinghui,

Zhan Jun,

Li Jiange,

Jin Jiahui,

Qian Lei

Publication year - 2020

Publication title -

concurrency and computation: practice and experience

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.309

H-Index - 67

eISSN - 1532-0634

pISSN - 1532-0626

DOI - 10.1002/cpe.5923

Subject(s) - computer science , data parallelism , pipeline (software) , parallel computing , gpu cluster , speedup , workload , memory footprint , parallelism (grammar) , partition (number theory) , distributed computing , cuda , computer architecture , operating system , mathematics , combinatorics

Summary Exorbitant resources (computing and memory) are required to train a deep neural network (DNN). Often researchers deploy an approach that uses distributed parallel training to acquire larger models faster on GPUs. This approach has its detriments, though; on one hand, a GPU's expanded capacity to compute also produces bigger bottlenecks in inter‐GPU's communications during model training, and multi‐GPU systems lead to complex connectivity. Workload schedulers then end up having to consider hardware topology and requirements for workload communication, in hopes of allocating GPU resources to optimize execution time and improve usage in a heterogeneous environment. On the other hand, the high memory requirements to train a DNN model make running the training processes on GPUs onerous. To contend with this, we introduce two execution optimization methods based on pipeline‐hybrid parallelism (using both data and model parallelism) in a GPU cluster with heterogeneous networking. First, we propose a model partition algorithm that accelerates pipeline‐hybrid parallelism training between heterogeneously network‐connected GPUs. Second, we introduce a cost‐balanced recomputing algorithm to reduce memory usage in the pipeline mode. Experiments show that our solution (Pipe‐Torch) averages a speedup of 1.4 × compared with data parallelism, and reduces the memory footprint while maintaining pipelined load‐balanced training.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research