Efficient parallel implementation of three‐point viterbi decoding algorithm on CPU, GPU, and FPGA | Zendy

Li Rongchun | Zendy; Dou Yong | Zendy; Zou Dan | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Efficient parallel implementation of three‐point viterbi decoding algorithm on CPU, GPU, and FPGA

Author(s) -

Li Rongchun,

Dou Yong,

Zou Dan

Publication year - 2013

Publication title -

concurrency and computation: practice and experience

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.309

H-Index - 67

eISSN - 1532-0634

pISSN - 1532-0626

DOI - 10.1002/cpe.3093

Subject(s) - computer science , speedup , parallel computing , viterbi decoder , field programmable gate array , viterbi algorithm , central processing unit , convolutional code , throughput , multi core processor , decoding methods , computer hardware , algorithm , wireless , telecommunications

SUMMARY In wireless communication, Viterbi decoding algorithm (VDA) is the one of most popular channel decoding algorithms, which is widely used in WLAN, WiMAX, or 3G communications. However, the throughput of Viterbi decoder is constrained by the convolutional characteristic. Recently, the three‐point VDA (TVDA) was proposed to solve this problem. In TVDA, the whole procedure can be divided into three phases, the forward, trace‐back, and decoding phases. In this paper, we analyze the parallelism of TVDA and propose parallel TVDA on the multi‐core CPU, graphics processing unit (GPU), and field programmable gate array (FPGA). We demonstrate approaches that fully exploit its performance potential on CPU, GPU, and FPGA computing platforms. For CPU platforms, we perform two optimization methods, single instruction multiple data and multithreading to gain over 145 × speedup over the naive CPU version on a quad‐core CPU platform. For GPU platforms, we propose the combination of cached memory optimization, coalesced global memory accesses, codeword packing scheme, and asynchronous data transition, achieving the throughput of 404.65 Mbps and 12 × speedup over initial GPU versions on an NVIDIA GeForce GTX580 card and 7 × speedup over Intel quad‐core CPU i5‐2300, under the same manufacturing year and both with fully optimized schemes. In addition, for FPGA platforms, we customize a radix‐4 pipelined architecture for the TVDA in a 45‐nm FPGA chip from Xilinx (XC6VLX760). Under 209.15‐MHz clock rate, it achieves a throughput of 418.30 Mbps. Finally, we also discuss the performance evaluation and efficiency comparison of different flexible architectures for real‐time Viterbi decoding in terms of the decoding throughput, power consumption, optimization schemes, programming costs, and price costs.Copyright © 2013 John Wiley & Sons, Ltd.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research