
From TeAAL to FuseMax: Separation of Concerns for Attention Accelerator Design
Author(s) -
Nandeeka Nayak,
Xinrui Wu,
Toluwanimi O. Odemuyiwa,
Michael Pellauer,
Joel S. Emer,
Christopher W. Fletcher
Publication year - 2025
Publication title -
ieee micro
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.649
H-Index - 94
eISSN - 1937-4143
pISSN - 0272-1732
DOI - 10.1109/mm.2025.3589955
Subject(s) - computing and processing
Attention for transformers has recently received significant ‘attention’ as a target for custom acceleration. Our prior work, TeAAL [4], proposes a new accelerator design methodology that allows architects to reason about and optimize their designs iteratively. With a focus on attention, this work makes contributions to both the theory and practice of TeAAL’s methodology. On the theory side, we propose a set of analyses that can be applied using only the algorithm specification—the cascade of Einsums. On the practice side, we use the new analyses to analyze and taxonomize the space of attention algorithms and to iteratively build up an efficient, high-utilization accelerator. Our resulting design, FuseMax, achieves an average 6.7× speedup on attention and 5.3× speedup on end-to-end transformer inference over the prior state-of-the-art, FLAT [3], while using 79% and 83% of the energy, respectively.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom