SAGAD: Synthetic Data Generator for Tabular Datasets
Author(s) -
Henrique Matheus F. da Silva,
Rafael S. Pereira Silva,
Fábio Porto
Publication year - 2021
Language(s) - English
Resource type - Conference proceedings
DOI - 10.5753/sbbd.2021.17861
Subject(s) - computer science , machine learning , artificial intelligence , generator (circuit theory) , training set , entropy (arrow of time) , labeled data , face (sociological concept) , data mining , social science , power (physics) , physics , quantum mechanics , sociology
The accuracy of machine learning models implementing classification tasks is strongly dependent on the quality of the training dataset. This is a challenge for domains where data is not abundant, such as personalized medicine,or unbalance, as in the case of images of plant species, where some species have very few samples while others offer large number of samples. In both scenarios,the resulting models tend to offer poor performance. In this paper we present two techniques to face this challenge. Firstly, we present a data augmentation method called SAGAD, based on conditional entropy. SAGAD can balance minority classes in conjunction with the increase of the overall size of the trainingset. In our experiments, the application of SAGAD in small data problems with different machine learning algorithms yielded significant improvement in performance. We additionally present an extension of SAGAD for iterative learning algorithms, called DABEL, which generates new samples for each epoch usingan optimization approach that continuously improves the model’s performance. The adoption of SAGAD and DABEL consistently extends the training dataset towards improved target classification performance.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom