Vectorization and Parallelization of the Adaptive Mesh Refinement N-Body Code
Author(s) -
Hideki Yahagi
Publication year - 2005
Publication title -
publications of the astronomical society of japan
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.99
H-Index - 110
eISSN - 2053-051X
pISSN - 0004-6264
DOI - 10.1093/pasj/57.5.779
Subject(s) - parallel computing , vectorization (mathematics) , computer science , supercomputer , adaptive mesh refinement , polygon mesh , code (set theory) , loop (graph theory) , data structure , distributed memory , automatic parallelization , shared memory , computational science , physics , algorithm , compiler , computer graphics (images) , programming language , mathematics , set (abstract data type) , combinatorics
In this paper, we describe our vectorized and parallelized adaptive meshrefinement (AMR) N-body code with shared time steps, and report its performanceon a Fujitsu VPP5000 vector-parallel supercomputer. Our AMR N-body code putshierarchical meshes recursively where higher resolution is required and thetime step of all particles are the same. The parts which are the most difficultto vectorize are loops that access the mesh data and particle data. Wevectorized such parts by changing the loop structure, so that the innermostloop steps through the cells instead of the particles in each cell, in otherwords, by changing the loop order from the depth-first order to thebreadth-first order. Mass assignment is also vectorizable using this loop orderexchange and splitting the loop into $2^{N_{dim}}$ loops, if the cloud-in-cellscheme is adopted. Here, $N_{dim}$ is the number of dimension. Thesevectorization schemes which eliminate the unvectorized loops are applicable toparallelization of loops for shared-memory multiprocessors. We alsoparallelized our code for distributed memory machines. The important part ofparallelization is data decomposition. We sorted the hierarchical mesh data bythe Morton order, or the recursive N-shaped order, level by level and split andallocated the mesh data to the processors. Particles are allocated to theprocessor to which the finest refined cells including the particles are alsoassigned. Our timing analysis using the $\Lambda$-dominated cold dark mattersimulations shows that our parallel code speeds up almost ideally up to 32processors, the largest number of processors in our test.Comment: 21pages, 16 figures, to be published in PASJ (Vol. 57, No. 5, Oct. 2005
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom