Performance Study of LU Decomposition on the Programmable GPU | Zendy

Fumihiko Ino | Zendy; Manabu Matsui | Zendy; Keigo Goda | Zendy; Kenichi Hagihara | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Performance Study of LU Decomposition on the Programmable GPU

Author(s) -

Fumihiko Ino,

Manabu Matsui,

Keigo Goda,

Kenichi Hagihara

Publication year - 2005

Publication title -

lecture notes in computer science

Language(s) - English

Resource type - Book series

SCImago Journal Rank - 0.249

H-Index - 400

eISSN - 1611-3349

pISSN - 0302-9743

ISBN - 3-540-30936-5

DOI - 10.1007/11602569_13

Subject(s) - computer science , parallel computing , lu decomposition , computation , graphics , floating point , matrix multiplication , cuda , general purpose computing on graphics processing units , multiplication (music) , graphics processing unit , decomposition , memory bandwidth , cache , central processing unit , computational science , matrix decomposition , computer hardware , computer graphics (images) , algorithm , ecology , eigenvalues and eigenvectors , physics , quantum mechanics , acoustics , quantum , biology

With the increasing programmability of graphics processing units (GPUs), these units are emerging as an attractive computing platform not only for traditional graphics computation but also for general-purpose computation. In this paper, to study the performance of programmable GPUs, we describe the design and implementation of LU decomposition as an example of numerical computation. To achieve this, we have developed and evaluated some methods with different implementation approaches in terms of (a) loop processing, (b) branch processing, and (c) vector processing. The experimental results give four important points: (1) dependent loops must be implemented through the use of a render texture in order to avoid copies in the video random access memory (VRAM); (2) in most cases, branch processing can be efficiently handled by the CPU rather than the GPU; (3) as Fatahalian et al. state for matrix multiplication, we find that GPUs require higher VRAM cache bandwidth in order to provide full performance for LU decomposition; and (4) decomposition results obtained by GPUs usually differ from those by CPUs, mainly due to the floating-point division error that increases the numerical error with the progress of decomposition.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research