Automatic Diacritics Restoration for Dialectal Arabic Text
Author(s) -
Ayman A. Zayyan,
Mohamed Elmahdy,
Husniza Husni,
Jihad Jaam
Publication year - 2016
Publication title -
international journal of computing and information sciences
Language(s) - English
Resource type - Journals
eISSN - 1708-0479
pISSN - 1708-0460
DOI - 10.21700/ijcis.2016.119
Subject(s) - arabic , linguistics , natural language processing , artificial intelligence , computer science , philosophy
In this paper, the problem of missing diacritic marks in most of dialectal Arabic written resources is addressed. Our aim is to implement a scalable and extensible platform for automatically retrieving the diacritic marks for undiacritized dialectal Arabic texts. Different rule-based and statistical techniques are proposed. These include: maximum likelihood estimate, and statistical n-gram models. The proposed platform includes helper tools for text pre-processing and encoding conversion. Diacritization accuracy of each technique is evaluated in terms of Diacritic Error Rate (DER) and Word Error Rate (WER). The approach trains several n-gram models on different lexical units. A data pool of both Modern Standard Arabic (MSA) data along with Dialectal Arabic data was used to train the models.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom