z-logo
open-access-imgOpen Access
Data Augmentation for Text Classification Using Autoencoders
Author(s) -
Mustafa Cataltas,
Ilyas Cicekli,
Nurdan Akhan Baykan
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3610157
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Deep learning models have greatly improved various natural language processing tasks. However, their effectiveness depends on large data sets, which can be difficult to acquire. To mitigate this challenge, data augmentation techniques are employed to artificially expand the training data by generating synthetic samples. By enriching the dataset, data augmentation enhances model generalization, reduces overfitting, and improves model performance. This paper investigates the effectiveness of employing autoencoders for text data augmentation to enhance the performance of text classification models. The research compares four types of autoencoders which are Traditional Autoencoder (AE), Adversarial Autoencoder (AAE), Denoising Adversarial Autoencoder (DAAE), and Variational Autoencoder (VAE). Basic text preprocessing techniques, which are lowercasing, removal of non-alphanumeric characters and removal of stop words, are applied to all documents. Additionally, label-based filtering is applied, where the outputs of autoencoders that contradict the predictions of BERT are eliminated. The experiments are conducted using the SST-2 sentiment classification dataset, which consists of 7,791 training instances and 1,821 test instances. To better analyze the impact of data augmentation methods, experiments are also performed on smaller subsets of 100, 200, 400, and 1,000 instances. Data augmentation is applied at ratios of 1:1, 1:2 and 1:4 for these subsets. The results demonstrate that AE-based data augmentation methods, particularly at a 1:1 ratio, achieve better accuracy than the baseline models. This underscores the potential of autoencoders in improving text classification outcomes in NLP tasks.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom