z-logo
open-access-imgOpen Access
An Approach for Indexing Web Data Sources
Author(s) -
Saidi Imene,
Nait Bahloul Safia
Publication year - 2014
Publication title -
international journal of information technology and computer science
Language(s) - English
Resource type - Journals
eISSN - 2074-9015
pISSN - 2074-9007
DOI - 10.5815/ijitcs.2014.09.07
Subject(s) - computer science , search engine indexing , information retrieval , web application , world wide web
Web information sources such as forums, blogs, and news articles are becoming increasingly large and diverse. Even if advances in technology are helping to improve techniques for dealing with the large amounts of the generated data, such data sources are heterogeneous in structure (semi structured or unstructured sources) and nature (texts or images). Implementation of software solutions is then necessary to prepare data and access these sources in a homogenous way. In this paper we present an approach for indexing heterogeneous data sources. Our objective is to offer techniques for efficient indexing of web sources by storing only the necessary information. We propose automatic indexing for semi structured or unstructured sources (e.g., xml files, html files) and annotation for other sources (e.g., images, videos that exist within a page). We present our algorithms of indexing and propose the use of MapReduce model to build a scalable inverted index. Experiments on a real-world corpus show that our approach achieves a good performance.

The content you want is available to Zendy users.

Already have an account? Click here to sign in.
Having issues? You can contact us here
Accelerating Research

Address

John Eccles House
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom