Premium
Research and implementation of a high performance parallel computing digital down converter on graphics processing unit
Author(s) -
Shao Guolin,
Chen Xingshu,
Yang Lu
Publication year - 2016
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.4042
Subject(s) - computer science , decimation , cuda , graphics processing unit , integrator , parallel computing , graphics , finite impulse response , kernel (algebra) , general purpose computing on graphics processing units , computer hardware , speedup , key (lock) , filter (signal processing) , computational science , algorithm , computer graphics (images) , operating system , bandwidth (computing) , telecommunications , mathematics , combinatorics , computer vision
Summary Digital down converter (DDC) is a time‐intensive and data‐intensive computing task and considered as the key technology in software defined radio. This paper proposes a high‐performance implementation of DDC on a graphics processing unit (GPU) using CUDA, which is composed of a numerically controlled oscillator stage, a cascaded integrator‐comb (CIC) decimation filter stage, and a finite impulse response (FIR) filter stage. The GPU implementation and optimizing of all the stages are studied in detail. Additionally, for handling a long‐duration signal, the signal data sequence is truncated into segments; the overlap‐save and overlap‐add mechanisms were applied in CIC stage and FIR stage, respectively. Finally, experiments were conducted to evaluate the performance of GPU‐based DDC with respect to a sequential version CPU implementation and an OpenMP implementation (16 threads). Experimental results demonstrate that the DDC achieves significant improvements on the GPU; the maximum speed ups in numerically controlled oscillator stage, CIC stage, and FIR stage can achieve more than 1242, 527, and 179 times, including data‐transfer, kernel execution, and other processing operations; the overall speed up of DDC can achieve more than 180. In the meantime, the speed ups of GPU implementation are far above the OpenMP implementation (about 2.5‐6.4 times).