z-logo
open-access-imgOpen Access
Query Optimization Algorithm of Replication Join Based on Sampling Partition
Author(s) -
Xin Lü,
Junchao Yang,
Jiao Yuan,
Xun Wang,
Kun Fu,
Ke Yang
Publication year - 2020
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1693/1/012074
Subject(s) - computer science , hash join , sort merge join , skew , join (topology) , partition (number theory) , query optimization , hash function , data mining , algorithm , theoretical computer science , mathematics , combinatorics , telecommunications , computer security
Aiming at the low efficiency of join query in MapReduce traditional partition join algorithm when data skew, a replication join optimization algorithm based on sampling partition is proposed. According to the sampled statistics of connection attribute data, the algorithm divides the datasets in connection relationship into skewed data subset and non skewed data subset. In order to optimize the query performance, join query processing is carried out on them respectively. For the join queries of non skewed data subsets, the improved consistency hash function is used to partition these subsets, so that the load of data connection query processing of each node is balanced. For the skewed data subset join query, the smaller skewed data subsets are distributed to each node, and the larger skewed data subsets are partitioned according to the non skewed fields. In the Reduce stage, these skewed data subsets are join queried. Experiments show that the algorithm can optimize the join query performance under different data skew rates, and achieve efficient join query processing of large datasets.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here