Large complex data: divide and recombine (D&R) with RHIPE | Zendy

Guha Saptarshi | Zendy; Hafen Ryan | Zendy; Rounds Jeremiah | Zendy; Xia Jin | Zendy; Li Jianfu | Zendy; Xi Bowei | Zendy; Cleveland William S. | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Large complex data: divide and recombine (D&R) with RHIPE

Author(s) -

Guha Saptarshi,

Hafen Ryan,

Rounds Jeremiah,

Xia Jin,

Li Jianfu,

Xi Bowei,

Cleveland William S.

Publication year - 2012

Publication title -

stat

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.61

H-Index - 18

ISSN - 2049-1573

DOI - 10.1002/sta4.7

Subject(s) - embarrassingly parallel , computer science , computation , visualization , big data , parallel computing , data visualization , statistical analysis , theoretical computer science , data mining , computational science , algorithm , mathematics , statistics

D&R is a new statistical approach to the analysis of large complex data. The data are divided into subsets. Computationally, each subset is a small dataset. Analytic methods are applied to each of the subsets, and the outputs of each method are recombined to form a result for the entire data. Computations can be run in parallel with no communication among them, making them embarrassingly parallel, the simplest possible parallel processing. Using D&R, a data analyst can apply almost any statistical or visualization method to large complex data. Direct application of most analytic methods to the entire data is either infeasible, or impractical. D&R enables deep analysis: comprehensive analysis, including visualization of the detailed data, that minimizes the risk of losing important information. One of our D&R research thrusts uses statistics to develop “best” division and recombination procedures for analytic methods. Another is a D&R computational environment that has two widely used components, R and Hadoop, and our RHIPE merger of them. Hadoop is a distributed database and parallel compute engine that executes the embarrassingly parallel D&R computations across a cluster. RHIPE allows analysis wholly from within R, making programming with the data very efficient. Copyright © 2012 John Wiley & Sons, Ltd.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research