Avoiding model selection bias in small-sample genomic datasets
Author(s) -
Daniel Berrar,
Ian Bradbury,
Werner Dubitzky
Publication year - 2006
Publication title -
bioinformatics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 3.599
H-Index - 390
eISSN - 1367-4811
pISSN - 1367-4803
DOI - 10.1093/bioinformatics/btl066
Subject(s) - computer science , resampling , sample size determination , context (archaeology) , selection (genetic algorithm) , sample (material) , data mining , sampling (signal processing) , sampling bias , model selection , artificial intelligence , machine learning , selection bias , statistics , mathematics , biology , paleontology , chemistry , filter (signal processing) , chromatography , computer vision
Genomic datasets generated by high-throughput technologies are typically characterized by a moderate number of samples and a large number of measurements per sample. As a consequence, classification models are commonly compared based on resampling techniques. This investigation discusses the conceptual difficulties involved in comparative classification studies. Conclusions derived from such studies are often optimistically biased, because the apparent differences in performance are usually not controlled in a statistically stringent framework taking into account the adopted sampling strategy. We investigate this problem by means of a comparison of various classifiers in the context of multiclass microarray data.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom