Premium
A novel approach for high‐dimensional vector similarity join query
Author(s) -
Ma Youzhong,
Jia Shijie,
Zhang Yongxin
Publication year - 2016
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.3952
Subject(s) - join (topology) , joins , computer science , nested loop join , scalability , speedup , similarity (geometry) , piecewise , theoretical computer science , aggregate (composite) , curse of dimensionality , data mining , algorithm , database , mathematics , artificial intelligence , parallel computing , combinatorics , image (mathematics) , mathematical analysis , materials science , composite material , programming language
SUMMARY In this paper, we mainly focus on similarity joins on massive high‐dimensional vectors, it is a costly operation because of curse of dimensionality. The main idea of our proposed solution is to design a novel filtering method that can filter out as many vector pairs as possible at relative low cost. By using the good dimension reduction ability of piecewise aggregate approximation and symbolic aggregate approximation techniques, we proposed three novel approaches to deal with high dimensional‐vector similarity join query: single‐PAA–based vector similarity join query, multi‐PAA–based vector similarity join query and SAX‐based vector similarity join query. We conducted comprehensive experiments to test the performance of the above approaches, we also test the speedup ratio compared with the naive method: block nested loop join. The experimental results show that our approaches have much better performance than that of block nested loop join and also have good scalability. To the best of our knowledge, this is the first work to try to deal with high‐dimensional vector similarity joins using piecewise aggregate approximation and symbolic aggregate approximation techniques, and the approaches proposed in this paper provide a new way to process the massive high‐dimensional vector data set.