z-logo
open-access-imgOpen Access
A Semi-Automated Record De-Duplication Technique for a Data Warehouse Environment
Author(s) -
Vaishali Wangikar,
Sachin N. Deshmukh,
Sunil Bhirud
Publication year - 2020
Publication title -
international journal of innovative technology and exploring engineering
Language(s) - English
Resource type - Journals
ISSN - 2278-3075
DOI - 10.35940/ijitee.b6265.019320
Subject(s) - computer science , blocking (statistics) , security token , correctness , key (lock) , automation , data mining , domain (mathematical analysis) , process (computing) , computer network , programming language , computer security , engineering , mechanical engineering , mathematical analysis , mathematics
Quality of Record de-duplication is a key factor in decision making process. Correctness in the identification of duplicates from a dataset provides a strong foundation for inference. Blocking is a popular technique in de-duplication. In the traditional de-duplication process blocking key is decided by the domain expert. In real time systems, automation of blocking key generation is a primary requirement. Blocking key generation without any human intervention is the objective of this paper. The proposed Automated Token Formation (ATF) algorithm is a fully automated way for blocking key generation. The attributes shortlisted by ATF are almost similar to that of the manual method for all datasets experimented. Datasets like Cora, Restaurant, and FEBRL are used. It is observed that the token provided by ATF has shown 20 % poor results over manual tokens for Cora dataset while for the other two datasets results are matching with manual tokens. A modification is made to ATF to improve the quality of the result by Semi-Automated Token Formation (SATF) algorithm. SATF is a semi-automated approach where training data is needed. SATF has shown better performance over all the manual tokens as well as tokens by ATF.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here