Premium
Large complex data: divide and recombine (D&R) with RHIPE
Author(s) -
Guha Saptarshi,
Hafen Ryan,
Rounds Jeremiah,
Xia Jin,
Li Jianfu,
Xi Bowei,
Cleveland William S.
Publication year - 2012
Publication title -
stat
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.61
H-Index - 18
ISSN - 2049-1573
DOI - 10.1002/sta4.7
Subject(s) - embarrassingly parallel , computer science , computation , visualization , big data , parallel computing , data visualization , statistical analysis , theoretical computer science , data mining , computational science , algorithm , mathematics , statistics
D&R is a new statistical approach to the analysis of large complex data. The data are divided into subsets. Computationally, each subset is a small dataset. Analytic methods are applied to each of the subsets, and the outputs of each method are recombined to form a result for the entire data. Computations can be run in parallel with no communication among them, making them embarrassingly parallel, the simplest possible parallel processing. Using D&R, a data analyst can apply almost any statistical or visualization method to large complex data. Direct application of most analytic methods to the entire data is either infeasible, or impractical. D&R enables deep analysis: comprehensive analysis, including visualization of the detailed data, that minimizes the risk of losing important information. One of our D&R research thrusts uses statistics to develop “best” division and recombination procedures for analytic methods. Another is a D&R computational environment that has two widely used components, R and Hadoop, and our RHIPE merger of them. Hadoop is a distributed database and parallel compute engine that executes the embarrassingly parallel D&R computations across a cluster. RHIPE allows analysis wholly from within R, making programming with the data very efficient. Copyright © 2012 John Wiley & Sons, Ltd.