z-logo
open-access-imgOpen Access
CABANA : Cluster-Aware Query Batching for Accelerating Billion-Scale ANNS With Intel AMX
Author(s) -
Minho Kim,
Houxiang Ji,
Jaeyoung Kang,
Hwanjun Lee,
Daehoon Kim,
Nam Sung Kim
Publication year - 2025
Publication title -
ieee computer architecture letters
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.272
H-Index - 36
eISSN - 1556-6064
pISSN - 1556-6056
DOI - 10.1109/lca.2025.3596970
Subject(s) - computing and processing
Retrieval-augmented generation (RAG) systems increasingly rely on Approximate Nearest Neighbor Search (ANNS) to efficiently retrieve relevant context from billion-scale vector databases. While IVF-based ANNS frameworks scale well overall, the fine search stage remains a bottleneck due to its compute-intensive GEMV operations, particularly under large query volumes. To address this, we propose CABANA , a c luster- a ware query b atching for AN NS a cceleration mechanism using Intel Advanced Matrix Extensions (AMX) that reformulates these GEMV computations into high-throughput GEMM operations. By aggregating queries targeting the same clusters, CABANA enables batched computation during fine search, significantly improving compute intensity and memory access regularity. Evaluations on billion-scale datasets show that CABANA outperforms traditional SIMD-based implementations, achieving up to $32.6\times$32 . 6 ×higher query throughput with minimal overhead, while maintaining high recall rates.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom