z-logo
open-access-imgOpen Access
Arabic Text Copy Detection using Full, Reduced and Unique Syntactical Structures
Author(s) -
Mohamed Taybe
Publication year - 2016
Publication title -
international journal of computer applications
Language(s) - English
Resource type - Journals
ISSN - 0975-8887
DOI - 10.5120/ijca2016912088
Subject(s) - computer science , arabic , natural language processing , artificial intelligence , information retrieval , linguistics , philosophy
This paper reports on work performed to investigate the use of a combined Part of Speech (POS) tagging and a minimum edit operations algorithm to determine the level of similarity between pairs of Arabic text documents. The level of similarity can be used as an indication of duplication in full or in part of the document's content. Text is first converted into POS tags that are then fed to the string similarity algorithm to determine the similarity of pairs of documents. A normalized score is calculated and used to rank documents. Documents ranked higher than some selected threshold are considered similar and can be near or complete duplicate. The performed experiments compare results based on the use of a set of selected common subsequences that are the results of translation of text into a sequence of syntactical units. The strings are first produced using full-text (FULL). These are further refined to produce a REDUCED; where repeated consecutive characters are reduced to a single character and a number, and more refined to produce a UNIQUE string; where all repeating characters are replaced by a single character. Syntactical features of the text were used as a structural representation of the documents' content. Results obtained from the experiments using the

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom