z-logo
Premium
Automatic specimen identification of Harpacticoids (Crustacea:Copepoda) using Random Forest and MALDI ‐ TOF mass spectra, including a post hoc test for false positive discovery
Author(s) -
Rossel Sven,
Martínez Arbizu Pedro
Publication year - 2018
Publication title -
methods in ecology and evolution
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 3.425
H-Index - 105
ISSN - 2041-210X
DOI - 10.1111/2041-210x.13000
Subject(s) - bootstrapping (finance) , random forest , smoothing , identification (biology) , computer science , false positive paradox , artificial intelligence , machine learning , statistics , mathematics , biology , ecology , econometrics
Ecological studies require accurate identification of specimens. This is very time consuming when processing plankton, meiobenthos or soil biota samples due to the presence of a high number of minute specimens. A solution to this problem may be MALDI ‐ TOF MS , an emerging technique for identification of metazoan species. As an alternative to factory delivered software or clustering approaches, Random Forest ( RF ) models can be trained to identify species, using MALDI ‐ TOF data. However, in a real‐world scenario, RF models will fail in detecting species which were not included in the training dataset as well, thus producing false positives (misidentifications). We produced MALDI ‐ TOF MS spectra for meiofauna species and trained RF models, using MALDI ‐ TOF bins as predictors and species as multi‐level target class. We used the empirical beta distribution of the probability of class assignment in the model to design a post hoc test for false positive discovery. Two strategies increase the final accuracy of species identification: (1) “class smoothing” consisting of in silico observations of classes, created by bootstrapping the value of each predictor within each class and: (2) adding a “null class” to the training dataset by bootstrapping predictor values and shuffling predictor labels creating a class without multivariate signal. We prove that RF is an excellent method for species identification, using MALDI ‐ TOF MS data. The models are flexible enough to correctly classify observations created in silico by smoothing the classes. Our post hoc test unmasks false positive classifications successfully. Smoothing the classes and adding a null class to the training model attracts assignment of false positives to this class. In our example, a 100% false positive discovery could be achieved, while maintaining very high overall prediction accuracy. Combining MALDI ‐ TOF MS and RF models is a step towards a fully automatic species identification workflow that is particularly necessary for species‐rich communities of small organism for ecological studies but also for routine monitoring. The post hoc test for false positive discovery can be applied to any RF multilevel classification model, not only in a biological context.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here