Premium
External cross‐validation for unbiased evaluation of protein family detectors: Application to allergens
Author(s) -
SoeriaAtmadja Daniel,
Wallman Mikael,
Björklund Åsa K.,
Isaksson Anders,
Hammerling Ulf,
Gustafsson Mats G.
Publication year - 2005
Publication title -
proteins: structure, function, and bioinformatics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 1.699
H-Index - 191
eISSN - 1097-0134
pISSN - 0887-3585
DOI - 10.1002/prot.20656
Subject(s) - detector , context (archaeology) , computer science , cross validation , algorithm , word error rate , selection (genetic algorithm) , statistics , mathematical optimization , mathematics , machine learning , artificial intelligence , biology , telecommunications , paleontology
Key issues in protein science and computational biology are design and evaluation of algorithms aimed at detection of proteins that belong to a specific family, as defined by structural, evolutionary, or functional criteria. In this context, several validation techniques are often used to compare different parameter settings of the detector, and to subsequently select the setting that yields the smallest error rate estimate. A frequently overlooked problem associated with this approach is that this smallest error rate estimate may have a large optimistic bias. Based on computer simulations, we show that a detector's error rate estimate can be overly optimistic and propose a method to obtain unbiased performance estimates of a detector design procedure. The method is founded on an external 10‐fold cross‐validation (CV) loop that embeds an internal validation procedure used for parameter selection in detector design. The designed detector generated in each of the 10 iterations are evaluated on held‐out examples exclusively available in the external CV iterations. Notably, the average of these 10 performance estimates is not associated with a final detector, but rather with the average performance of the design procedure used. We apply the external CV loop to the particular problem of detecting potentially allergenic proteins, using a previously reported design procedure. Unbiased performance estimates of the allergen detector design procedure are presented together with information about which algorithms and parameter settings that are most frequently selected. Proteins 2005. © 2005 Wiley‐Liss, Inc.