Open Access
Computational load reduction of fuzzy duplicate detection in large amounts of information
Author(s) -
Ekaterina Sharapova
Publication year - 2020
Publication title -
iop conference series. materials science and engineering
Language(s) - English
Resource type - Journals
eISSN - 1757-899X
pISSN - 1757-8981
DOI - 10.1088/1757-899x/734/1/012119
Subject(s) - completeness (order theory) , computer science , data mining , reduction (mathematics) , signature (topology) , fuzzy logic , set (abstract data type) , context (archaeology) , fuzzy set , information retrieval , artificial intelligence , mathematics , programming language , mathematical analysis , paleontology , geometry , biology
The paper deals with the detection of fuzzy duplicates of documents in large amounts of information with low computational costs. The existing methods give either low search completeness at low computational costs, or acceptable completeness at very large computational costs. It is proposed to use combined method of detecting fuzzy duplicates. At the beginning of the whole set of documents with the help of signatures similar texts are searched and then, using context analysis methods, a detailed comparison of the texts found in this way is carried out. The method first performs an approximate search for similar documents using description words signature with a small match threshold. A detailed search for matches in previously found documents is performed using the shingles method.