Corpus-based technique for improving Arabic OCR system
Author(s) -
Ahmed H. Aliwy,
Basheer Al-Sadawi
Publication year - 2021
Publication title -
indonesian journal of electrical engineering and computer science
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.241
H-Index - 17
eISSN - 2502-4760
pISSN - 2502-4752
DOI - 10.11591/ijeecs.v21.i1.pp233-241
Subject(s) - computer science , optical character recognition , natural language processing , character (mathematics) , arabic , word (group theory) , artificial intelligence , sentence , context (archaeology) , speech recognition , process (computing) , span (engineering) , image (mathematics) , linguistics , engineering , mathematics , programming language , paleontology , philosophy , civil engineering , geometry , biology
An optical character recognition (OCR) refers to a process of converting the text document images into editable and searchable text. OCR process poses several challenges in particular in the Arabic language due to it has caused a high percentage of errors. In this paper, a method, to improve the outputs of the Arabic Optical character recognition (AOCR) Systems is suggested based on a statistical language model built from the available huge corpora. This method includes detecting and correcting non-word and real words error according to the context of the word in the sentence. The results show that the percentage of improvement in the results is up to (98%) as a new accuracy for AOCR output.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom