SAGAD: Synthetic Data Generator for Tabular Datasets | Zendy

Henrique Matheus F. da Silva | Zendy; Rafael S. Pereira Silva | Zendy; Fábio Porto | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

SAGAD: Synthetic Data Generator for Tabular Datasets

Author(s) -

Henrique Matheus F. da Silva,

Rafael S. Pereira Silva,

Fábio Porto

Publication year - 2021

Language(s) - English

Resource type - Conference proceedings

DOI - 10.5753/sbbd.2021.17861

Subject(s) - computer science , machine learning , artificial intelligence , generator (circuit theory) , training set , entropy (arrow of time) , labeled data , face (sociological concept) , data mining , social science , power (physics) , physics , quantum mechanics , sociology

The accuracy of machine learning models implementing classification tasks is strongly dependent on the quality of the training dataset. This is a challenge for domains where data is not abundant, such as personalized medicine,or unbalance, as in the case of images of plant species, where some species have very few samples while others offer large number of samples. In both scenarios,the resulting models tend to offer poor performance. In this paper we present two techniques to face this challenge. Firstly, we present a data augmentation method called SAGAD, based on conditional entropy. SAGAD can balance minority classes in conjunction with the increase of the overall size of the trainingset. In our experiments, the application of SAGAD in small data problems with different machine learning algorithms yielded significant improvement in performance. We additionally present an extension of SAGAD for iterative learning algorithms, called DABEL, which generates new samples for each epoch usingan optimization approach that continuously improves the model’s performance. The adoption of SAGAD and DABEL consistently extends the training dataset towards improved target classification performance.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research