
Estimating Median in The Multi-sourced Heterogeneous Data Set: A distributed implementation
Author(s) -
Tian Ming Gao,
Yanyu Zhao,
Xinyi Deng,
Wenze Li,
Hongjun Li,
Xin Zhao
Publication year - 2020
Publication title -
journal of physics. conference series
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.21
H-Index - 85
eISSN - 1742-6596
pISSN - 1742-6588
DOI - 10.1088/1742-6596/1437/1/012020
Subject(s) - computer science , set (abstract data type) , big data , data set , data mining , data integration , task (project management) , interpolation (computer graphics) , engineering , artificial intelligence , programming language , animation , computer graphics (images) , systems engineering
The continuous running of enterprise applications produces a huge volume of business data that reside in different storage and system environments and owned by different companies or organizations, which forms a typical distributed, multi-sourced heterogeneous dataset. The multi-sourced heterogeneous data set provides big potential values for official statistics. While the median is a commonly used indicator in official statistics, it is not a trivial task to estimate the median in the distributed computing environment of multi-sourced heterogeneous data set due to its mathematical nature. In this paper, we proposed a distributed method to estimate the median value for the multi-sourced heterogeneous data set. Mainly considering the different size of multi-sourced data set and unevenly distribution of their data values, we first improve the traditional interpolation based median estimation method for the multi-sourced heterogeneous data set. Then, we propose distributed implementation for the proposed median estimation method based on web service technology. Finally, we evaluate the accuracy and performance of proposed method through experimental study. CCS Concepts Information systems ➝ Data management systems ➝ Information integration ➝ Mediators and data integration