
A Technique for Big Data Deduplication based on Content Attributes and Dictionary Indexing
Author(s) -
Duaa S. Naji,
Loay E. George
Publication year - 2020
Publication title -
iop conference series. materials science and engineering
Language(s) - English
Resource type - Journals
eISSN - 1757-899X
pISSN - 1757-8981
DOI - 10.1088/1757-899x/928/3/032062
Subject(s) - data deduplication , backup , computer science , search engine indexing , hash function , data mining , set (abstract data type) , field (mathematics) , data set , big data , information retrieval , cluster analysis , database , artificial intelligence , mathematics , computer security , pure mathematics , programming language
In recent years, the quick expansion of the data such as text, image, audio, video, data centers, and backup data has caused to a lot of problems in both storage and recovery processes. The companies spend plenty of money to store the data. Hence, a need for an efficient technique becomes necessary for handling enormous data. In this paper, we propose to set up for new de-duplication for the contents of big data set. The divisors are selected in an automated way using the fields separator, different dictionary indexing methods will be used to de-duplicate the fields contents those have bounded variability. Also a set of computationally low-cost hash functions will be used for speeding up the deduplication for fields consist of long strings. The number, nature, and length of fields will be tested. Besides that, certain kinds of indexing and clustering methodology will be applied to define the optimal way to decrease the data size before making de-duplication.