PoLAPACK: parallel factorization routines with algorithmic blocking | Zendy

Choi Jaeyoung | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

PoLAPACK: parallel factorization routines with algorithmic blocking

Author(s) -

Choi Jaeyoung

Publication year - 2001

Publication title -

concurrency and computation: practice and experience

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.309

H-Index - 67

eISSN - 1532-0634

pISSN - 1532-0626

DOI - 10.1002/cpe.589

Subject(s) - cholesky decomposition , computer science , parallel computing , block (permutation group theory) , block size , sparse matrix , blocking (statistics) , incomplete cholesky factorization , lu decomposition , factorization , computation , speedup , matrix decomposition , algorithm , mathematics , eigenvalues and eigenvectors , computer network , physics , geometry , computer security , quantum mechanics , key (lock) , gaussian

LU, QR, and Cholesky factorizations are the most widely used methods for solving dense linear systems of equations, and have been extensively studied and implemented on vector and parallel computers. Most of these factorization routines are implemented with block‐partitioned algorithms in order to perform matrix–matrix operations, that is, to obtain the highest performance by maximizing reuse of data in the upper levels of memory, such as cache. Since parallel computers have different performance ratios of computation and communication, the optimal computational block sizes are different from one another in order to generate the maximum performance of an algorithm. Therefore, the ata matrix should be distributed with the machine specific optimal block size before the computation. Too small or large a block size makes achieving good performance on a machine nearly impossible. In such a case, getting a better performance may require a complete redistribution of the data matrix. In this paper, we present parallel LU, QR, and Cholesky factorization routines with an ‘algorithmic blocking’ on two‐dimensional block cyclic data distribution. With the algorithmic blocking, it is possible to obtain the near optimal performance irrespective of the physical block size. The routines are implemented on the Intel Paragon and the SGI/Cray T3E and compared with the corresponding ScaLAPACK factorization routines. Copyright © 2001 John Wiley & Sons, Ltd.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research