An efficient similarity join approach on large‐scale high‐dimensional data using random projection | Zendy

Ma Youzhong | Zendy; Zhang Ruiling | Zendy; Jia Shijie | Zendy; Zhang Yongxin | Zendy; Meng Xiaofeng | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

An efficient similarity join approach on large‐scale high‐dimensional data using random projection

Author(s) -

Ma Youzhong,

Zhang Ruiling,

Jia Shijie,

Zhang Yongxin,

Meng Xiaofeng

Publication year - 2019

Publication title -

concurrency and computation: practice and experience

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.309

H-Index - 67

eISSN - 1532-0634

pISSN - 1532-0626

DOI - 10.1002/cpe.5303

Subject(s) - join (topology) , similarity (geometry) , projection (relational algebra) , random projection , scalability , dimension (graph theory) , computer science , scale (ratio) , dimensionality reduction , data mining , block (permutation group theory) , algorithm , curse of dimensionality , filter (signal processing) , mathematics , artificial intelligence , database , combinatorics , physics , quantum mechanics , image (mathematics) , computer vision

Summary Similarity join on large‐scale high‐dimensional data faces major challenges because of the data scale and the cure of dimensionality. Random projection with p‐stable distribution can reduce the high‐dimensional data form d ‐dimension to k ‐dimension ( k ≪ d ), the distance of the data in k ‐dimensional space can be used to filter out as many data pairs as possible at relative low cost. Based on the above idea, we proposed two novel approaches to deal with large‐scale high‐dimensional data similarity join: projection‐based similarity join (PromSimJ) algorithm and projection space partitioning–based similarity join (ProSPSimJ) algorithm. The comprehensive experiments were performed to test the performance of the above methods. We also compared the performance of the above methods with that of the naive method block nested loop join. The final experimental results prove that our approaches have much better performance and good scalability.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research