cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs | Zendy

ValeroLara Pedro | Zendy; MartínezPérez Ivan | Zendy; Sirvent Raül | Zendy; Martorell Xavier | Zendy; Peña Antonio J. | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

cuThomasBatch and cuThomasVBatch, CUDA Routines to compute batch of tridiagonal systems on NVIDIA GPUs

Author(s) -

ValeroLara Pedro,

MartínezPérez Ivan,

Sirvent Raül,

Martorell Xavier,

Peña Antonio J.

Publication year - 2018

Publication title -

concurrency and computation: practice and experience

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.309

H-Index - 67

eISSN - 1532-0634

pISSN - 1532-0626

DOI - 10.1002/cpe.4909

Subject(s) - parallel computing , computer science , cuda , tridiagonal matrix , scalability , computation , thread (computing) , implementation , general purpose computing on graphics processing units , speedup , block (permutation group theory) , block size , exploit , shared memory , computational science , algorithm , graphics , mathematics , key (lock) , programming language , eigenvalues and eigenvectors , physics , computer security , quantum mechanics , computer graphics (images) , geometry , database

Summary The solving of tridiagonal systems is one of the most computationally expensive parts in many applications, so that multiple studies have explored the use of NVIDIA GPUs to accelerate such computation. However, these studies have mainly focused on using parallel algorithms to compute such systems, which can efficiently exploit the shared memory and are able to saturate the GPUs capacity with a low number of systems, presenting a poor scalability when dealing with a relatively high number of systems. The gtsvStridedBatch routine in the cuSPARSE NVIDIA package is one of these examples, which is used as reference in this article. We propose a new implementation ( cuThomasBatch ) based on the Thomas algorithm. Unlike other algorithms, the Thomas algorithm is sequential, and so a coarse‐grained approach is implemented where one CUDA thread solves a complete tridiagonal system instead of one CUDA block as in gtsvStridedBatch . To achieve a good scalability using this approach, it is necessary to carry out a transformation in the way that the inputs are stored in memory to exploit coalescence (contiguous threads access to contiguous memory locations). Different variants regarding the transformation of the data are explored in detail. We also explore some variants for the case of variable batch, when the size of the systems of the batch has different size ( cuThomasVBatch ). The results given in this study prove that the implementations carried out in this work are able to beat the reference code, being up to 5× (in double precision) and 6× (in single precision) faster using the latest NVIDIA GPU architecture, the Pascal P100.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore