An Efficient Parallel Top-k Similarity Join for Massive Multidimensional Data Using Spark | Zendy

Dehua Chen | Zendy; Changgan Shen | Zendy; Jieying Feng | Zendy; Jiajin Le | Zendy

AI Assistant Blog Pricing

Open Access

An Efficient Parallel Top-k Similarity Join for Massive Multidimensional Data Using Spark

Author(s) -

Dehua Chen,

Changgan Shen,

Jieying Feng,

Jiajin Le

Publication year - 2015

Publication title -

international journal of database theory and application

Language(s) - English

Resource type - Journals

eISSN - 2207-9688

pISSN - 2005-4270

DOI - 10.14257/ijdta.2015.8.3.06

Subject(s) - computer science , join (topology) , spark (programming language) , similarity (geometry) , data mining , multidimensional data , theoretical computer science , artificial intelligence , programming language , mathematics , combinatorics , image (mathematics)

Top-k similarity join has been used in a wide range of applications that require calculating the most top-k similar pairs of data records in a given database. However, the time performance will be a challenging problem, as an increasing trend of applications that need to process massive data. Obviously, finding the top-k pairs in such vast amounts of data with traditional methods is awkward. In this paper, we propose the RDD-based algorithm to perform the top-k similarity join for massive multidimensional data over a large cluster built with commodity machines using Spark. The RDD-based algorithm consists of four steps, which loads a set of multidimensional records stored in HDFS and finally output an ordered list of top-k closest pairs into HDFS. Firstly, we develop an efficient distance function based on LSH(Locality Sensitive Hashing) to improve the efficiency in pairwise similarity comparison. Secondly, to minimize the amount of data during the RDD running-time, we split conceptually all pairs of LSH signatures into partitions. Moreover, we exploit a serial computation strategy to calculate all top-k closest pairs in parallel. Finally, all the local top-k pairs sorted by their Hamming distances will contribute to the global top-k pairs. In this paper, the performance evaluation between Spark and Hadoop confirms the effectiveness and scalability of our RDD-based algorithm.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom

About

About Careers Publisher Partners Contact Us Our institutional solutions Get Organisational Trial or Quote

Learn

FAQs Blog Terms of Use Privacy Policy

Download the Zendy App

Discover

Explore

Home ZAIA Blog