
Low-resource noisy transliteration normalization using large-scale language model
Author(s) -
Zolzaya Byambadorj,
Ulziibayar Sonom-Ochir,
Munkhsukh Enkhbayar,
Hyun-chul Kim,
Altangerel Ayush
Publication year - 2025
Publication title -
ieee access
Language(s) - English
Resource type - Magazines
SCImago Journal Rank - 0.587
H-Index - 127
eISSN - 2169-3536
DOI - 10.1109/access.2025.3574933
Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation
Transliteration normalization is a crucial task for low-resource languages, particularly for Mongolian, where noisy text from social media presents significant challenges. The frequent use of non-standard transliteration can contribute to the gradual erosion of linguistic knowledge, particularly among young users, making it harder to maintain proficiency in their native language. Therefore, developing robust methods for normalizing such text is essential. In this paper, we propose a novel approach leveraging large-scale neural models, specifically GPT-2, to normalize noisy transliterated Mongolian text. Our study explores a data-driven approach, including word pairs, sentence pairs, and synthetic data, to enhance model performance. To further improve accuracy, we introduce a post-processing module that integrates Edit Distance-based corrections with a context-aware ranking mechanism using the Mongolian BERT model. Experimental results demonstrate that our approach (M10: 16.42%) improves overall accuracy by approximately 4.91%, while achieving a 10.44% increase in out-of-vocabulary (OOV) word normalization compared to baseline models. Our proposed approach demonstrates effectiveness in normalizing noisy transliterated text under low-resource conditions.
Empowering knowledge with every search
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom