
Corpus-based technique for improving Arabic OCR system
Author(s) -
Ahmed H. Aliwy,
Basheer Al-Sadawi
Publication year - 2021
Publication title -
indonesian journal of electrical engineering and computer science
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.241
H-Index - 17
eISSN - 2502-4760
pISSN - 2502-4752
DOI - 10.11591/ijeecs.v21.i1.pp233-241
Subject(s) - computer science , optical character recognition , natural language processing , character (mathematics) , arabic , word (group theory) , artificial intelligence , sentence , context (archaeology) , speech recognition , process (computing) , span (engineering) , image (mathematics) , linguistics , engineering , mathematics , programming language , paleontology , philosophy , civil engineering , geometry , biology
An optical character recognition (OCR) refers to a process of converting the text document images into editable and searchable text. OCR process poses several challenges in particular in the Arabic language due to it has caused a high percentage of errors. In this paper, a method, to improve the outputs of the Arabic Optical character recognition (AOCR) Systems is suggested based on a statistical language model built from the available huge corpora. This method includes detecting and correcting non-word and real words error according to the context of the word in the sentence. The results show that the percentage of improvement in the results is up to (98%) as a new accuracy for AOCR output.