Premium
An overview of techniques for linking high‐dimensional molecular data to time‐to‐event endpoints by risk prediction models
Author(s) -
Binder Harald,
Porzelius Christine,
Schumacher Martin
Publication year - 2011
Publication title -
biometrical journal
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.108
H-Index - 63
eISSN - 1521-4036
pISSN - 0323-3847
DOI - 10.1002/bimj.201000152
Subject(s) - covariate , computer science , feature selection , data mining , identification (biology) , predictive modelling , event (particle physics) , clustering high dimensional data , accelerated failure time model , high dimensional , machine learning , artificial intelligence , biology , botany , physics , quantum mechanics , cluster analysis
Analysis of molecular data promises identification of biomarkers for improving prognostic models, thus potentially enabling better patient management. For identifying such biomarkers, risk prediction models can be employed that link high‐dimensional molecular covariate data to a clinical endpoint. In low‐dimensional settings, a multitude of statistical techniques already exists for building such models, e.g. allowing for variable selection or for quantifying the added value of a new biomarker. We provide an overview of techniques for regularized estimation that transfer this toward high‐dimensional settings, with a focus on models for time‐to‐event endpoints. Techniques for incorporating specific covariate structure are discussed, as well as techniques for dealing with more complex endpoints. Employing gene expression data from patients with diffuse large B‐cell lymphoma, some typical modeling issues from low‐dimensional settings are illustrated in a high‐dimensional application. First, the performance of classical stepwise regression is compared to stagewise regression, as implemented by a componentwise likelihood‐based boosting approach. A second issues arises, when artificially transforming the response into a binary variable. The effects of the resulting loss of efficiency and potential bias in a high‐dimensional setting are illustrated, and a link to competing risks models is provided. Finally, we discuss conditions for adequately quantifying the added value of high‐dimensional gene expression measurements, both at the stage of model fitting and when performing evaluation.