z-logo
open-access-imgOpen Access
Methodology for the Assessment of the Text Similarity of Documents in the CORE Open Access Data Set of Scholarly Documents
Author(s) -
Ivan Kovačič,
David Bajs,
Milan Ojsteršek
Publication year - 2021
Language(s) - English
Resource type - Conference proceedings
DOI - 10.18690/978-961-286-516-0.12
Subject(s) - computer science , substring , metadata , information retrieval , set (abstract data type) , hash function , pairwise comparison , similarity (geometry) , core (optical fiber) , cosine similarity , data set , data mining , cluster analysis , artificial intelligence , world wide web , programming language , image (mathematics) , telecommunications
This paper describes the methodology of data preparation and analysis of the text similarity required for plagiarism detection on the CORE data set. Firstly, we used the CrossREF API and Microsoft Academic Graph data set for metadata enrichment and elimination of duplicates of doc-uments from the CORE 2018 data set. In the second step, we used 4-gram sequences of words from every document and transformed them into SHA-256 hash values. Features retrieved using hashing algorithm are compared, and the result is a list of documents and the percentages of cov-erage between pairs of documents features. In the third step, called pairwise feature-based ex-haustive analysis, pairs of documents are checked using the longest common substring.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here