Vectorization and Parallelization of the Adaptive Mesh Refinement N-Body Code | Zendy

Hideki Yahagi | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Vectorization and Parallelization of the Adaptive Mesh Refinement N-Body Code

Author(s) -

Hideki Yahagi

Publication year - 2005

Publication title -

publications of the astronomical society of japan

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 1.99

H-Index - 110

eISSN - 2053-051X

pISSN - 0004-6264

DOI - 10.1093/pasj/57.5.779

Subject(s) - parallel computing , vectorization (mathematics) , computer science , supercomputer , adaptive mesh refinement , polygon mesh , code (set theory) , loop (graph theory) , data structure , distributed memory , automatic parallelization , shared memory , computational science , physics , algorithm , compiler , computer graphics (images) , programming language , mathematics , set (abstract data type) , combinatorics

In this paper, we describe our vectorized and parallelized adaptive meshrefinement (AMR) N-body code with shared time steps, and report its performanceon a Fujitsu VPP5000 vector-parallel supercomputer. Our AMR N-body code putshierarchical meshes recursively where higher resolution is required and thetime step of all particles are the same. The parts which are the most difficultto vectorize are loops that access the mesh data and particle data. Wevectorized such parts by changing the loop structure, so that the innermostloop steps through the cells instead of the particles in each cell, in otherwords, by changing the loop order from the depth-first order to thebreadth-first order. Mass assignment is also vectorizable using this loop orderexchange and splitting the loop into $2^{N_{dim}}$ loops, if the cloud-in-cellscheme is adopted. Here, $N_{dim}$ is the number of dimension. Thesevectorization schemes which eliminate the unvectorized loops are applicable toparallelization of loops for shared-memory multiprocessors. We alsoparallelized our code for distributed memory machines. The important part ofparallelization is data decomposition. We sorted the hierarchical mesh data bythe Morton order, or the recursive N-shaped order, level by level and split andallocated the mesh data to the processors. Particles are allocated to theprocessor to which the finest refined cells including the particles are alsoassigned. Our timing analysis using the $\Lambda$-dominated cold dark mattersimulations shows that our parallel code speeds up almost ideally up to 32processors, the largest number of processors in our test.Comment: 21pages, 16 figures, to be published in PASJ (Vol. 57, No. 5, Oct. 2005

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research