Online updating method with new variables for big data streams | Zendy

Wang Chun | Zendy; Chen MingHui | Zendy; Wu Jing | Zendy; Yan Jun | Zendy; Zhang Yuping | Zendy; Schifano Elizabeth | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Premium

Online updating method with new variables for big data streams

Author(s) -

Wang Chun,

Chen MingHui,

Wu Jing,

Yan Jun,

Zhang Yuping,

Schifano Elizabeth

Publication year - 2018

Publication title -

canadian journal of statistics

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.804

H-Index - 51

eISSN - 1708-945X

pISSN - 0319-5724

DOI - 10.1002/cjs.11330

Subject(s) - computer science , data stream mining , big data , data stream , context (archaeology) , data mining , set (abstract data type) , data set , linear regression , algorithm , machine learning , artificial intelligence , telecommunications , paleontology , biology , programming language

For big data arriving in streams online updating is an important statistical method that breaks the storage barrier and the computational barrier under certain circumstances. In the regression context online updating algorithms assume that the set of predictor variables does not change, and consequently cannot incorporate new variables that may become available midway through the data stream. A naive approach would be to discard all previous information and start updating with new variables from scratch. We propose a method that utilizes the information from earlier data in the online updating algorithm with bias corrections to improve efficiency. The method is developed for linear models first, and then extended to estimating equations for generalized linear models. Closed‐form expressions for the efficiency gain over the naive approach are derived in a particular linear model setting. We compare the performance of our proposed bias‐correcting approach and the naive approach in simulation studies with data generated from a normal linear model and a logistic regression model. The method is applied to a study on airline delay, where reasons for delays were only available more recently, starting in 2003. The Canadian Journal of Statistics 46: 123–146; 2018 © 2017 Statistical Society of Canada

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here

Accelerating Research