Fast sparse matrix-vector multiplication by exploiting variable block structure | Zendy

Richard Vuduc | Zendy; Hichan Moon | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Fast sparse matrix-vector multiplication by exploiting variable block structure

Author(s) -

Richard Vuduc,

Hichan Moon

Publication year - 2005

Publication title -

osti oai (u.s. department of energy office of scientific and technical information)

Language(s) - English

Resource type - Reports

DOI - 10.2172/891708

Subject(s) - sparse matrix , computer science , matrix (chemical analysis) , parallel computing , block (permutation group theory) , data structure , algorithm , block matrix , memory bandwidth , block size , multiplication (music) , cache , supercomputer , theoretical computer science , mathematics , key (lock) , gaussian , combinatorics , eigenvalues and eigenvectors , materials science , physics , computer security , quantum mechanics , composite material , programming language

We improve the performance of sparse matrix-vector multiply (SpMV) on modern cache-based superscalar machines when the matrix structure consists of multiple, irregularly aligned rectangular blocks. Matrices from finite element modeling applications often have this kind of structure. Our technique splits the matrix, A, into a sum, A{sub 1} + A{sub 2} + ... + A{sub s}, where each term is stored in a new data structure, unaligned block compressed sparse row (UBCSR) format . The classical alternative approach of storing A in a block compressed sparse row (BCSR) format yields limited performance gains because it imposes a particular alignment of the matrix non-zero structure, leading to extra work from explicitly padded zeros. Combining splitting and UBCSR reduces this extra work while retaining the generally lower memory bandwidth requirements and register-level tiling opportunities of BCSR. Using application test matrices, we show empirically that speedups can be as high as 2.1x over not blocking at all, and as high as 1.8x over the standard BCSR implementation used in prior work. When performance does not improve, split UBCSR can still significantly reduce matrix storage. Through extensive experiments, we further show that the empirically optimal number of splittings s and the block size for each matrix term A{sub i} will in practice depend on the matrix and hardware platform. Our data lay a foundation for future development of fully automated methods for tuning these parameters

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research