Premium
C2CU: a CUDA C program generator for bulk execution of a sequential algorithm
Author(s) -
Takafuji Daisuke,
Nakano Koji,
Ito Yasuaki,
Bordim Jacir
Publication year - 2016
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.4022
Subject(s) - computer science , cuda , parallel computing , sorting algorithm , sorting , algorithm , matrix multiplication , generator (circuit theory) , computation , execution time , power (physics) , physics , quantum mechanics , quantum
Summary Several important tasks, including matrix computation, signal processing, sorting, dynamic programming, encryption, and decryption, can be performed by oblivious sequential algorithms. A sequential algorithm is oblivious if an address accessed at each time does not depend on the input data. A bulk execution of a sequential algorithm is to execute it for many independent inputs in turn or in parallel. A number of works have been devoted to design and implement parallel algorithms for a single input. However, none of these works evaluated the bulk execution performance of these algorithms. The first contribution of this paper is to present a time‐optimal implementation for bulk execution of an oblivious sequential algorithm. Our second contribution is to develop a tool, named C2CU, which automatically generates a CUDA C program for a bulk execution of an oblivious sequential algorithm. The C2CU has been used to generate CUDA C programs for the bulk execution of the bitonic sorting, Floyd‐Warshall, and Montgomery modulo multiplication algorithms. Compared to a sequential implementation on a single CPU, the generated CUDA C programs for the above algorithms run, respectively, 199, 54, and 78 times faster.