
DeDuSERP: De-duplication in search engine result page
Author(s) -
Naresh Sharma,
Priti Dimri
Publication year - 2018
Publication title -
international journal of engineering and technology
Language(s) - English
Resource type - Journals
ISSN - 2227-524X
DOI - 10.14419/ijet.v7i2.8.10475
Subject(s) - computer science , information retrieval , search engine , timestamp , web page , world wide web , filter (signal processing) , web search engine , static web page , string (physics) , database , web search query , web development , mathematics , computer security , mathematical physics , computer vision
Web offers a new way of service provision by arranging different resources over the web. The most critical and prominent is web searches. The purpose of this research is to identify a subtype of De-Duplication. DeDuSERP is de-duplication in search engine result page. It restricts the showcasing of urls with duplicate or similar data and hence enhances the search result experience of any client. By duplicate results we mean different links containing the same content or information. To solve this problem, we have designed a filter between Search engine result page and indexed-ranked pages which we get from the search engine in response to the query of the searcher. This filter eliminates the duplicate links idiosyncratically and displays the unique results on the SERP for the searcher. We have performed the string to string comparison of web pages and if the content is 90% similar then we adjudge them as duplicates and then check their inventiveness of these duplicate links on the basis of timestamp. By this we mean then the web page crawled earlier is original. The process of comparison and timestamp matching is done using an open source apache API Commons IO 2.4.