
S 3 A-NPU: A High-Performance Hardware Accelerator for Spiking Self-Supervised Learning With Dynamic Adaptive Memory Optimization
Author(s) -
Heuijee Yun,
Daejin Park
Publication year - 2025
Publication title -
ieee transactions on very large scale integration (vlsi) systems
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.506
H-Index - 105
eISSN - 1557-9999
pISSN - 1063-8210
DOI - 10.1109/tvlsi.2025.3566949
Subject(s) - components, circuits, devices and systems , computing and processing
Spiking self-supervised learning (SSL) has become prevalent for low power consumption and low-latency properties, as well as the ability to learn from large quantities of unlabeled data. However, the computational intensity and resource requirements are significant challenges to apply to accelerators. In this article, we propose the scalable, spiking self-supervised learning, streamline optimization accelerator ( $S^{3}$ A)-neural processing unit (NPU), a highly optimized accelerator for spiking SSL models. This architecture minimizes memory access by leveraging input data provided by the user and optimizes computation through the maximization of data reuse. By dynamically optimizing memory based on model characteristics and implementing specialized operations for data preprocessing, which are critical in SSL, computational efficiency can be significantly improved. The parallel processing lanes account for the two encoders in the SSL architecture, combined with a pipelined structure that considers the temporal data accumulation of spiking neural networks (SNNs) to enhance computational efficiency. We evaluate the design on field-programmable gate array (FPGA), where a 16-bit quantized spiking residual network (ResNet) model trained on the Canadian Institute for Advanced Research (CIFAR) and MNIST dataset has top 94.08% accuracy. $S^{3}$ A-NPU optimization significantly improved computational resource utilization, resulting in a 25% reduction in latency. Moreover, as the first spiking self-supervised accelerator, it demonstrated highly efficient computation compared to existing accelerators, utilizing only 29k look up tables (LUTs) and eight block random access memories (BRAMs). This makes it highly suitable for resource-constrained applications, particularly in the context of spiking SSL models on edge devices. We implemented it on a silicon chip using a 130-nm process design kit (PDK), and the design was less than $1~\text {cm}^{2}$ .
Empowering knowledge with every search
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom