Design and Implementation of Cache Memory with Dual Unit Tile/Line Accessibility | Zendy

Baokang Wang | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Design and Implementation of Cache Memory with Dual Unit Tile/Line Accessibility

Author(s) -

Baokang Wang

Publication year - 2019

Publication title -

mathematical problems in engineering

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.262

H-Index - 62

eISSN - 1026-7077

pISSN - 1024-123X

DOI - 10.1155/2019/9601961

Subject(s) - computer science , cache , parallel computing , cpu cache , translation lookaside buffer , cache coloring , page cache , cache pollution , cache algorithms , computer hardware , embedded system , physical address , semiconductor memory

In recent years, the increasing disparity between the data access speed of cache and processing speeds of processors has caused a major bottleneck in achieving high-performance 2-dimensional (2D) data processing, such as that in scientific computing and image processing. To solve this problem, this paper proposes new dual unit tile/line access cache memory based on a hierarchical hybrid Z-ordering data layout and multibank cache organization supporting skewed storage schemes. The proposed layout improves 2D data locality and reduces L1 cache misses and Translation Lookaside Buffer (TLB) misses efficiently and it is transformed from conventional raster layout by a simple hardware-based address translation unit. In addition, we proposed an aligned tile set replacement algorithm (ATSRA) for reduction of the hardware overhead in the tag memory of the proposed cache. Simulation results using Matrix Multiplication (MM) illustrated that the proposed cache with parallel unit tile/line accessibility can reduce both the L1 cache and TLB misses considerably as compared with conventional raster layout and Z-Morton order layout. The number of parallel load instructions for parallel unit tile/line access was reduced to only about one-fourth of the conventional load instruction. The execution time for parallel load instruction was reduced to about one-third of that required for conventional load instruction. By using 40 nm Complementary Metal-Oxide-Semiconductor (CMOS) technology, we combined the proposed cache with a SIMD-based data path and designed a 5 × 5 mm 2 Large-Scale Integration (LSI) chip. The entire hardware overhead of the proposed ATSRA-cache was reduced to only 105% of that required for a conventional cache by using the ATSRA method.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Empowering knowledge with every search

About

About Careers Publisher Partners Contact Us

Learn

FAQs Blog Terms of Use Privacy Policy

About

Learn

Discover

Explore