z-logo
Premium
Skew‐aware online aggregation over joins through guided sampling
Author(s) -
Wang Yuxiang,
Jin Jiahui,
Xu Xiaoliang,
Zhang Longbin
Publication year - 2018
Publication title -
concurrency and computation: practice and experience
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.309
H-Index - 67
eISSN - 1532-0634
pISSN - 1532-0626
DOI - 10.1002/cpe.4695
Subject(s) - joins , skew , computer science , sampling (signal processing) , data mining , benchmark (surveying) , key (lock) , sample (material) , skewness , sampling distribution , statistics , mathematics , telecommunications , chemistry , computer security , geodesy , filter (signal processing) , chromatography , computer vision , programming language , geography
Summary Online aggregation is a query processing technique that returns approximate answers with error guarantees (in the form of confidence intervals) continuously during the query execution process. This approach offers users a suitable tradeoff between query efficiency and accuracy. The key issue of online aggregation is how to ensure a random sample collection's efficiency and effectiveness. However, the often‐used “blind” sampling method does not adequately consider dataset statistics and other useful information, leading to inefficient sampling and poor sample quality. This becomes a glaring performance issue for skewed data distribution over joins. To alleviate this problem, we utilize dataset statistics to propose a new “guided” sampling approach, which consists of a logic‐partition‐based weighted Gaussian sampling method tailored for the skewed join key, as well as a two‐level sample allocation method that applies to the skewed measured value. Extensive experiments using the TPC‐H benchmark for skewed data distribution demonstrate our solution's superior performance.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here