CABANA : Cluster-Aware Query Batching for Accelerating Billion-Scale ANNS With Intel AMX
Author(s) -
Minho Kim,
Houxiang Ji,
Jaeyoung Kang,
Hwanjun Lee,
Daehoon Kim,
Nam Sung Kim
Publication year - 2025
Publication title -
ieee computer architecture letters
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.272
H-Index - 36
eISSN - 1556-6064
pISSN - 1556-6056
DOI - 10.1109/lca.2025.3596970
Subject(s) - computing and processing
Retrieval-augmented generation (RAG) systems increasingly rely on Approximate Nearest Neighbor Search (ANNS) to efficiently retrieve relevant context from billion-scale vector databases. While IVF-based ANNS frameworks scale well overall, the fine search stage remains a bottleneck due to its compute-intensive GEMV operations, particularly under large query volumes. To address this, we propose CABANA , a c luster- a ware query b atching for AN NS a cceleration mechanism using Intel Advanced Matrix Extensions (AMX) that reformulates these GEMV computations into high-throughput GEMM operations. By aggregating queries targeting the same clusters, CABANA enables batched computation during fine search, significantly improving compute intensity and memory access regularity. Evaluations on billion-scale datasets show that CABANA outperforms traditional SIMD-based implementations, achieving up to $32.6\times$32 . 6 ×higher query throughput with minimal overhead, while maintaining high recall rates.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom