z-logo
open-access-imgOpen Access
A Novel Data Augmentation Framework for Arabic Multi-label Text Classification using AraBART, AraGPT2, and Borderline-SMOTE
Author(s) -
Samia F. Abd-hood,
Nazlia Omar,
Sabrina Tiun
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3609462
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Data Augmentation (DA) techniques present solutions for Natural Language Processing (NLP) to address class imbalance and data scarcity. The current solutions for class imbalance, either random oversampling or random under-sampling techniques, suffer from several issues. For instance, oversampling leads to overfitting due to replication, whilst under-sampling leads to loss of information due to removals. Meanwhile, traditional DA techniques, including paraphrasing, rule-based, or noising approaches, require strong lexicons. These techniques are also either time-consuming or introduce noise, resulting in incorrect syntactical and semantic contexts. Hence, this paper aims to propose a novel DA framework for Arabic news articles to address the prevailing challenges in Arabic Multi-Label Text Classification (AMLTC). The proposed framework consists of three phases: abstractive summarization using the Arabic Bidirectional and Auto-Regressive Transformer (AraBART) model to introduce new features, data generation with Arabic Generative Pre-trained Transformer (AraGPT2) to create diverse and contextual texts, and data balancing using borderline Synthetic Minority Over-Sampling Technique (SMOTE) to achieve an optimal balance. Each phase was evaluated to ensure the quality of the augmented data. Furthermore, a Bidirectional Long Short-Term Memory (BiLSTM) model was conducted to assess the performance of the augmented dataset (augDS) on a multi-label Arabic RTN news dataset. The results demonstrated that the proposed framework effectively addressed the class imbalance problem by preserving data integrity and significantly improving Multi-Label Text Classification (MLTC) performance compared to the non-augDS. Specifically, the F1-score increased from 0.54% on the original dataset to 0.90% after augmentation. Overall, this study demonstrates that the proposed framework successfully addresses the issues in Arabic datasets by generating diverse, novel augmented data. Additionally, it enhanced MLTC performance, showcasing its effectiveness.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom