Measuring Short Text Reuse for the Urdu Language | Zendy

Sara Sameen | Zendy; Muhammad Sharjeel | Zendy; Rao Muhammad Adeel Nawab | Zendy; Paul Rayson | Zendy; Iqra Muneer | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

Measuring Short Text Reuse for the Urdu Language

Author(s) -

Sara Sameen,

Muhammad Sharjeel,

Rao Muhammad Adeel Nawab,

Paul Rayson,

Iqra Muneer

Publication year - 2018

Publication title -

ieee access

Language(s) - English

Resource type - Journals

SCImago Journal Rank - 0.587

H-Index - 127

ISSN - 2169-3536

DOI - 10.1109/access.2017.2776842

Subject(s) - aerospace , bioengineering , communication, networking and broadcast technologies , components, circuits, devices and systems , computing and processing , engineered materials, dielectrics and plasmas , engineering profession , fields, waves and electromagnetics , general topics for engineers , geoscience , nuclear engineering , photonics and electrooptics , power, energy and industry applications , robotics and control systems , signal processing and analysis , transportation

Text reuse occurs when one borrows the text (either verbatim or paraphrased) from an earlier written text. A large and increasing amount of digital text is easily and readily available, making it simpler to reuse but difficult to detect. As a result, automatic detection of text reuse has attracted the attention of the research community due to the wide variety of applications associated with it. To develop and evaluate automatic methods for text reuse detection, standard evaluation resources are required. In this paper, we propose one such resource for a significantly under-resourced language-Urdu, which is widely used in day to day communication and has a large digital footprint particularly in the Indian subcontinent. Our proposed Urdu short text reuse corpus contains 2684 short Urdu text pairs, manually labeled as verbatim (496), paraphrased (1329), and independently written (859). In addition, we describe an evaluation of the corpus using various state-of-the-art text reuse detection methods with binary and multi-classification settings and a set of four classifiers. Output results show that character n-gram overlap using J48 classifier outperform other methods for the Urdu short text reuse detection task.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research