
Efficient Cache Performance Equivalent 2D-Texel to Memory Mapping Identification for Embedded GPUs
Author(s) -
Ahmed El-Mahdy,
Marwa K. Elteir,
Kholoud Shata
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3573668
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
With the increasing trend of utilizing smart devices, exploiting embedded GPUs becomes plausible to provide for intense computations that approach workstation performance, opening the way for complex, server-grade applications on such devices. Although general-purpose development frameworks exist, they generally abstract implementation details making it difficult for further performance exploitation. One such critical aspect is the texture memory hierarchy design parameters, which are primarily trade secrets. Unfortunately, standard cache hierarchy identification methods are not applicable, due to utilizing logical 2D-texel-to-memory mapping to exploit the 2D locality of access prior to physical cache access. This paper presents, for the first time, a parameterized model capable of describing the underlying multi-dimensional tiling layouts governing this mapping. Although it can be shown that reverse-engineering the model to obtain the exact parameters is an undecidable problem, we strive to obtain a corresponding set of parameters that results in the same cache behavior. In particular, such parameters define the texture order resulting in a linear order that traverses the caches in memory contiguous blocks. This study proposes a reverse-engineering observation method that exploits the contiguity of tiled cached regions and cache set conflicts to efficiently reveal such parameters. The complexity of the method is O ( n 2 ), where n is the number of bits in one of the texture buffer dimensions, ensuring practical applicability. Furthermore, optimization of a benchmark workload—via input data and memory access pattern alignment with the inferred multidimensional tiling layout—yields up to a 2.22× speedup.