z-logo
open-access-imgOpen Access
INVESTIGATION OF TECHNIQUES FOR EFFICIENT & ACCURATE INDEXING FOR SCALABLE RECORD LINKAGE & DEDUPLICATION
Author(s) -
Sunitha Yeddula,
K. Lakshmaiah
Publication year - 2015
Publication title -
international journal of computer and communication technology
Language(s) - English
Resource type - Journals
eISSN - 2231-0371
pISSN - 0975-7449
DOI - 10.47893/ijcct.2015.1275
Subject(s) - data deduplication , computer science , search engine indexing , scalability , matching (statistics) , record linkage , database , data mining , process (computing) , linkage (software) , information retrieval , population , demography , sociology , gene , biochemistry , statistics , chemistry , mathematics , operating system
Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many applications areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today’s databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of variations of six indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. These experiments highlight that one of the most important factors for efficient and accurate indexing for record linkage and deduplication is the proper definition of blocking keys.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here