Premium
An efficient similarity join approach on large‐scale high‐dimensional data using random projection
Author(s) -
Ma Youzhong,
Zhang Ruiling,
Jia Shijie,
Zhang Yongxin,
Meng Xiaofeng
Publication year - 2019
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.5303
Subject(s) - join (topology) , similarity (geometry) , projection (relational algebra) , random projection , scalability , dimension (graph theory) , computer science , scale (ratio) , dimensionality reduction , data mining , block (permutation group theory) , algorithm , curse of dimensionality , filter (signal processing) , mathematics , artificial intelligence , database , combinatorics , physics , quantum mechanics , image (mathematics) , computer vision
Summary Similarity join on large‐scale high‐dimensional data faces major challenges because of the data scale and the cure of dimensionality. Random projection with p‐stable distribution can reduce the high‐dimensional data form d ‐dimension to k ‐dimension ( k ≪ d ), the distance of the data in k ‐dimensional space can be used to filter out as many data pairs as possible at relative low cost. Based on the above idea, we proposed two novel approaches to deal with large‐scale high‐dimensional data similarity join: projection‐based similarity join (PromSimJ) algorithm and projection space partitioning–based similarity join (ProSPSimJ) algorithm. The comprehensive experiments were performed to test the performance of the above methods. We also compared the performance of the above methods with that of the naive method block nested loop join. The final experimental results prove that our approaches have much better performance and good scalability.