z-logo
Premium
Active learning support vector machines for optimal sample selection in classification
Author(s) -
Zomer Simeone,
Del Nogal Sánchez Miguel,
Brereton Richard G.,
Pérez Pavón José L.
Publication year - 2004
Publication title -
journal of chemometrics
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.47
H-Index - 92
eISSN - 1099-128X
pISSN - 0886-9383
DOI - 10.1002/cem.872
Subject(s) - labelling , support vector machine , classifier (uml) , computer science , artificial intelligence , pattern recognition (psychology) , machine learning , structured support vector machine , data mining , chemistry , biochemistry
Labelling samples is a procedure that may result in significant delays particularly when dealing with larger datasets and/or when labelling implies prolonged analysis. In such cases a strategy that allows the construction of a reliable classifier on the basis of a minimal sized training set by labelling a minor fraction of samples can be of advantage. Support vector machines (SVMs) are ideal for such an approach because the classifier relies on only a small subset of samples, namely the support vectors, while being independent from the remaining ones that typically form the majority of the dataset. This paper describes a procedure where a SVM classifier is constructed with support vectors systematically retrieved from the pool of unlabelled samples. The procedure is termed ‘active’ because the algorithm interacts with the samples prior to their labelling rather than waiting passively for the input. The learning behaviour on simulated datasets is analysed and a practical application for the detection of hydrocarbons in soils using mass spectrometry is described. Results on simulations show that the active learning SVM performs optimally on datasets where the classes display an intermediate level of separation. On the real case study the classifier correctly assesses the membership of all samples in the original dataset by requiring for labelling around 14% of the data. Its subsequent application on a second dataset of analogous nature also provides perfect classification without further labelling, giving the same outcome as most classical techniques based on the entirely labelled original dataset. Copyright © 2004 John Wiley & Sons, Ltd.

This content is not available in your region!

Continue researching here.

Having issues? You can contact us here