Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model | Zendy

Mohamad Al Hajj Hassan | Zendy; Mostafa Bamha | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Towards Scalability and Data Skew Handling in GroupBy-Joins using MapReduce Model

Author(s) -

Mohamad Al Hajj Hassan,

Mostafa Bamha

Publication year - 2015

Publication title -

procedia computer science

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.334

H-Index - 76

ISSN - 1877-0509

DOI - 10.1016/j.procs.2015.05.200

Subject(s) - computer science , skew , joins , scalability , join (topology) , spark (programming language) , distributed computing , programming paradigm , computation , big data , parallel computing , data mining , database , algorithm , programming language , telecommunications , mathematics , combinatorics

For over a decade, MapReduce has become the leading programming model for parallel and massive processing of large volumes of data. This has been driven by the development of many frameworks such as Spark, Pig and Hive, facilitating data analysis on large-scale systems. However, these frameworks still remain vulnerable to communication costs, data skew and tasks imbalance problems. This can have a devastating effect on the performance and on the scalability of these systems, more particularly when treating GroupBy-Join queries of large datasets.In this paper, we present a new GroupBy-Join algorithm allowing to reduce communication costs considerably while avoiding data skew effects. A cost analysis of this algorithm shows that our approach is insensitive to data skew and ensures perfect balancing properties during all stages of GroupBy-Join computation even for highly skewed data. These performances have been confirmed by a series of experimentations

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research