Emmerald: a fast matrix–matrix multiply using Intel's SSE instructions | Zendy

Aberdeen Douglas | Zendy; Baxter Jonathan | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Emmerald: a fast matrix–matrix multiply using Intel's SSE instructions

Author(s) -

Aberdeen Douglas,

Baxter Jonathan

Publication year - 2001

Publication title -

concurrency and computation: practice and experience

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.309

H-Index - 67

eISSN - 1532-0634

pISSN - 1532-0626

DOI - 10.1002/cpe.549

Subject(s) - pentium , parallel computing , computer science , simd , mmx , matrix multiplication , matrix (chemical analysis) , memory hierarchy , cache , physics , materials science , quantum mechanics , composite material , quantum

Generalized matrix–matrix multiplication forms the kernel of many mathematical algorithms, hence a faster matrix–matrix multiply immediately benefits these algorithms. In this paper we implement efficient matrix multiplication for large matrices using the Intel Pentium single instruction multiple data (SIMD) floating point architecture. The main difficulty with the Pentium and other commodity processors is the need to efficiently utilize the cache hierarchy, particularly given the growing gap between main‐memory and CPU clock speeds. We give a detailed description of the register allocation, Level 1 and Level 2 cache blocking strategies that yield the best performance for the Pentium III family. Our results demonstrate an average performance of 2.09 times faster than the leading public domain matrix–matrix multiply routines and comparable performance with Intel's SIMD small matrix–matrix multiply routines. Copyright © 2001 John Wiley & Sons, Ltd.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research