Analyzing large datasets with bootstrap penalization | Zendy

Fang Kuangnan | Zendy; Ma Shuangge | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Analyzing large datasets with bootstrap penalization

Author(s) -

Fang Kuangnan,

Ma Shuangge

Publication year - 2017

Publication title -

biometrical journal

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 1.108

H-Index - 63

eISSN - 1521-4036

pISSN - 0323-3847

DOI - 10.1002/bimj.201600052

Subject(s) - covariate , regularization (linguistics) , sample size determination , computer science , feature selection , big data , data set , homogeneous , model selection , selection (genetic algorithm) , mathematics , r package , variable (mathematics) , data mining , algorithm , statistics , artificial intelligence , mathematical analysis , combinatorics

Data with a large p (number of covariates) and/or a large n (sample size) are now commonly encountered. For many problems, regularization especially penalization is adopted for estimation and variable selection. The straightforward application of penalization to large datasets demands a “big computer” with high computational power. To improve computational feasibility, we develop bootstrap penalization, which dissects a big penalized estimation into a set of small ones, which can be executed in a highly parallel manner and each only demands a “small computer”. The proposed approach takes different strategies for data with different characteristics. For data with a large p but a small to moderate n , covariates are first clustered into relatively homogeneous blocks. The proposed approach consists of two sequential steps. In each step and for each bootstrap sample, we select blocks of covariates and run penalization. The results from multiple bootstrap samples are pooled to generate the final estimate. For data with a large n but a small to moderate p , we bootstrap a small number of subjects, apply penalized estimation, and then conduct a weighted average over multiple bootstrap samples. For data with a large p and a large n , the natural marriage of the previous two methods is applied. Numerical studies, including simulations and data analysis, show that the proposed approach has computational and numerical advantages over the straightforward application of penalization. An R package has been developed to implement the proposed methods.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research