z-logo
open-access-imgOpen Access
An Efficient Parallel Top-k Similarity Join for Massive Multidimensional Data Using Spark
Author(s) -
Dehua Chen,
Changgan Shen,
Jieying Feng,
Jiajin Le
Publication year - 2015
Publication title -
international journal of database theory and application
Language(s) - English
Resource type - Journals
eISSN - 2207-9688
pISSN - 2005-4270
DOI - 10.14257/ijdta.2015.8.3.06
Subject(s) - computer science , join (topology) , spark (programming language) , similarity (geometry) , data mining , multidimensional data , theoretical computer science , artificial intelligence , programming language , mathematics , combinatorics , image (mathematics)
Top-k similarity join has been used in a wide range of applications that require calculating the most top-k similar pairs of data records in a given database. However, the time performance will be a challenging problem, as an increasing trend of applications that need to process massive data. Obviously, finding the top-k pairs in such vast amounts of data with traditional methods is awkward. In this paper, we propose the RDD-based algorithm to perform the top-k similarity join for massive multidimensional data over a large cluster built with commodity machines using Spark. The RDD-based algorithm consists of four steps, which loads a set of multidimensional records stored in HDFS and finally output an ordered list of top-k closest pairs into HDFS. Firstly, we develop an efficient distance function based on LSH(Locality Sensitive Hashing) to improve the efficiency in pairwise similarity comparison. Secondly, to minimize the amount of data during the RDD running-time, we split conceptually all pairs of LSH signatures into partitions. Moreover, we exploit a serial computation strategy to calculate all top-k closest pairs in parallel. Finally, all the local top-k pairs sorted by their Hamming distances will contribute to the global top-k pairs. In this paper, the performance evaluation between Spark and Hadoop confirms the effectiveness and scalability of our RDD-based algorithm.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom