Premium
High‐dimensional data analysis: Selection of variables, data compression and graphics – Application to gene expression
Author(s) -
Läuter Jürgen,
Horn Friedemann,
Rosołowski Maciej,
Glimm Ekkehard
Publication year - 2009
Publication title -
biometrical journal
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.108
H-Index - 63
eISSN - 1521-4036
pISSN - 0323-3847
DOI - 10.1002/bimj.200800207
Subject(s) - overfitting , resampling , computer science , expression (computer science) , data mining , parametric statistics , feature selection , dimension (graph theory) , selection (genetic algorithm) , graphics , curse of dimensionality , algorithm , mathematics , artificial intelligence , statistics , artificial neural network , computer graphics (images) , pure mathematics , programming language
The paper presents effective and mathematically exact procedures for selection of variables which are applicable in cases with a very high dimension as, for example, in gene expression analysis. Choosing sets of variables is an important method to increase the power of the statistical conclusions and to facilitate the biological interpretation. For the construction of sets, each single variable is considered as the centre of potential sets of variables. Testing for significance is carried out by means of the Westfall‐Young principle based on resampling or by the parametric method of spherical tests. The particular requirements for statistical stability are taken into account; each kind of overfitting is avoided. Thus, high power is attained and the familywise type I error can be kept in spite of the large dimension. To obtain graphical representations by heat maps and curves, a specific data compression technique is applied. Gene expression data from B‐cell lymphoma patients serve for the demonstration of the procedures.