A hybrid Technique for Cleaning Missing and Misspelling Arabic Data in Data Warehouse | Zendy

Mohammed Abdullah Al-Hagery | Zendy; Latifah A. Alreshoodi | Zendy; Maram Abdullah Almutairi | Zendy; Suha Ibrahim Al-Sharekh | Zendy; Emtenan Saad Alkhowaiter | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

A hybrid Technique for Cleaning Missing and Misspelling Arabic Data in Data Warehouse

Author(s) -

Mohammed Abdullah Al-Hagery,

Latifah A. Alreshoodi,

Maram Abdullah Almutairi,

Suha Ibrahim Al-Sharekh,

Emtenan Saad Alkhowaiter

Publication year - 2019

Publication title -

international journal of information technology and computer science

Language(s) - English

Resource type - Journals

eISSN - 2074-9015

pISSN - 2074-9007

DOI - 10.5815/ijitcs.2019.07.03

Subject(s) - computer science , arabic , missing data , data quality , natural language processing , consistency (knowledge bases) , unification , task (project management) , decision tree , data mining , modern standard arabic , data warehouse , artificial intelligence , machine learning , linguistics , programming language , service (business) , philosophy , economy , management , economics

Real-World datasets accumulated over a number of years tend to be incomplete, inconsistent and contain noisy data, this, in turn, will cause an inconsistency of data warehouses. Data owners are having hundred-millions to billions of records written in different languages, hence continuously increases the need for comprehensive, efficient techniques to maintain data consistency and increase its quality. It is known that the data cleaning is a very complex and difficult task, especially for the data written in Arabic as a complex language, where various types of unclean data can occur to the contents. For example, missing values, dummy values, redundant, inconsistent values, misspelling, and noisy data. The ultimate goal of this paper is to improve the data quality by cleaning the contents of Arabic datasets from various types of errors, to produce data for better analysis and highly accurate results. This, in turn, leads to discover correct patterns of knowledge and get an accurate Decision-Making. This approach established based on the merging of different algorithms. It ensures that reliable methods are used for data cleansing. This approach cleans the Arabic datasets based on the multi-level cleaning using Arabic Misspelling Detection, Correction Model (AMDCM), and Decision Tree Induction (DTI). This approach can solve the problems of Arabic language misspelling, cryptic values, dummy values, and unification of naming styles. A sample of data before and after cleaning errors presented.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research