Is cross-validation better than resubstitution for ranking genes?
Author(s) -
Ulisses Braga-Neto,
Ronaldo F. Hashimoto,
Edward R. Dougherty,
Danh V. Nguyen,
Raymond J. Carroll
Publication year - 2004
Publication title -
bioinformatics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 3.599
H-Index - 390
eISSN - 1367-4811
pISSN - 1367-4803
DOI - 10.1093/bioinformatics/btg399
Subject(s) - computer science , classifier (uml) , artificial intelligence , word error rate , probabilistic classification , cross validation , machine learning , linear discriminant analysis , pattern recognition (psychology) , data mining , probabilistic logic , ranking (information retrieval) , linear classifier , support vector machine , naive bayes classifier
Ranking gene feature sets is a key issue for both phenotype classification, for instance, tumor classification in a DNA microarray experiment, and prediction in the context of genetic regulatory networks. Two broad methods are available to estimate the error (misclassification rate) of a classifier. Resubstitution fits a single classifier to the data, and applies this classifier in turn to each data observation. Cross-validation (in leave-one-out form) removes each observation in turn, constructs the classifier, and then computes whether this leave-one-out classifier correctly classifies the deleted observation. Resubstitution typically underestimates classifier error, severely so in many cases. Cross-validation has the advantage of producing an effectively unbiased error estimate, but the estimate is highly variable. In many applications it is not the misclassification rate per se that is of interest, but rather the construction of gene sets that have the potential to classify or predict. Hence, one needs to rank feature sets based on their performance.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom