
SAGAD: Synthetic Data Generator for Tabular Datasets
Author(s) -
Henrique Matheus F. da Silva,
Rafael S. Pereira Silva,
Fábio Porto
Publication year - 2021
Language(s) - English
Resource type - Conference proceedings
DOI - 10.5753/sbbd.2021.17861
Subject(s) - computer science , machine learning , artificial intelligence , generator (circuit theory) , training set , entropy (arrow of time) , face (sociological concept) , data mining , power (physics) , physics , quantum mechanics , social science , sociology
The accuracy of machine learning models implementing classification tasks is strongly dependent on the quality of the training dataset. This is a challenge for domains where data is not abundant, such as personalized medicine,or unbalance, as in the case of images of plant species, where some species have very few samples while others offer large number of samples. In both scenarios,the resulting models tend to offer poor performance. In this paper we present two techniques to face this challenge. Firstly, we present a data augmentation method called SAGAD, based on conditional entropy. SAGAD can balance minority classes in conjunction with the increase of the overall size of the trainingset. In our experiments, the application of SAGAD in small data problems with different machine learning algorithms yielded significant improvement in performance. We additionally present an extension of SAGAD for iterative learning algorithms, called DABEL, which generates new samples for each epoch usingan optimization approach that continuously improves the model’s performance. The adoption of SAGAD and DABEL consistently extends the training dataset towards improved target classification performance.