CPU/GPU COMPUTING FOR AN IMPLICIT MULTI-BLOCK COMPRESSIBLE NAVIER-STOKES SOLVER ON HETEROGENEOUS PLATFORM | Zendy

Liang Deng | Zendy; Hanli Bai | Zendy; Fang Wang | Zendy; Qing-Xin Xu | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

CPU/GPU COMPUTING FOR AN IMPLICIT MULTI-BLOCK COMPRESSIBLE NAVIER-STOKES SOLVER ON HETEROGENEOUS PLATFORM

Author(s) -

Liang Deng,

Hanli Bai,

Fang Wang,

Qing-Xin Xu

Publication year - 2016

Publication title -

international journal of modern physics conference series

Language(s) - English

Resource type - Journals

ISSN - 2010-1945

DOI - 10.1142/s2010194516601630

Subject(s) - computer science , parallel computing , cuda , solver , scalability , xeon , thread (computing) , general purpose computing on graphics processing units , xeon phi , multi core processor , computational science , block (permutation group theory) , computation , algorithm , graphics , computer graphics (images) , operating system , geometry , mathematics , programming language

CPU/GPU computing allows scientists to tremendously accelerate their numerical codes. In this paper, we port and optimize a double precision alternating direction implicit (ADI) solver for three-dimensional compressible Navier-Stokes equations from our in-house Computational Fluid Dynamics (CFD) software on heterogeneous platform. First, we implement a full GPU version of the ADI solver to remove a lot of redundant data transfers between CPU and GPU, and then design two fine-grain schemes, namely “one-thread-one-point” and “one-thread-one-line”, to maximize the performance. Second, we present a dual-level parallelization scheme using the CPU/GPU collaborative model to exploit the computational resources of both multi-core CPUs and many-core GPUs within the heterogeneous platform. Finally, considering the fact that memory on a single node becomes inadequate when the simulation size grows, we present a tri-level hybrid programming pattern MPI-OpenMP-CUDA that merges fine-grain parallelism using OpenMP and CUDA threads with coarse-grain parallelism using MPI for inter-node communication. We also propose a strategy to overlap the computation with communication using the advanced features of CUDA and MPI programming. We obtain speedups of 6.0 for the ADI solver on one Tesla M2050 GPU in contrast to two Xeon X5670 CPUs. Scalability tests show that our implementation can offer significant performance improvement on heterogeneous platform.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research