A Novel Approach for NearDuplicate Detection of Web Pages using TDW Matrix | Zendy

Midhun P Mathew | Zendy; Shine N Das | Zendy; T R Lakshmi Narayanan | Zendy; Pramod K. Vijayaraghavan | Zendy

AI Assistant Blog Pricing

Home ZAIA Blog

Open Access

A Novel Approach for NearDuplicate Detection of Web Pages using TDW Matrix

Author(s) -

Midhun P Mathew,

Shine N Das,

T R Lakshmi Narayanan,

Pramod K. Vijayaraghavan

Publication year - 2011

Publication title -

international journal of computer applications

Language(s) - English

Resource type - Journals

ISSN - 0975-8887

DOI - 10.5120/2374-3128

Subject(s) - computer science , matrix (chemical analysis) , information retrieval , world wide web , chemistry , chromatography

voluminous amount of web documents has weakened the performance and reliability of web search engines. The subsistence of near-duplicate data is an issue that accompanies the growing need to incorporate heterogeneous data. Web content mining face huge problems due to the existence of duplicate and near-duplicate web pages. These pages either increase the index storage space or increase the serving costs thereby irritating the users. Near-duplicate detection has been recognized as an important one in the field of plagiarism detection, spam detection and in focused web crawling scenarios. Here we propose a novel idea for finding near- duplicates of an input web-page, from a huge repository. We proposes a TDW matrix based algorithm with three phases, rendering, filtering and verification, which receives an input web-page and a threshold in its first phase , prefix filtering and positional filtering to reduce the size of records in the second phase and returns an optimal set of near-duplicate web pages in the verification phase after calculating its similarity. The experimental results show that our algorithm outperforms in terms of two benchmark measures, precision and recall, and a reduction in the size of competing record set.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.

Having issues? You can contact us here

Accelerating Research