A random walk approach to sampling hidden databases
Author(s) -
Arjun Dasgupta,
Gautam Das,
Heikki Mannila
Publication year - 2007
Publication title -
uta researchcommons (university of texas arlington)
Language(s) - English
Resource type - Conference proceedings
DOI - 10.1145/1247480.1247550
Subject(s) - computer science , probabilistic logic , random walk , data mining , interface (matter) , sample (material) , probabilistic database , sampling (signal processing) , database , space (punctuation) , information retrieval , database theory , artificial intelligence , relational database , mathematics , statistics , chemistry , bubble , chromatography , filter (signal processing) , maximum bubble pressure method , parallel computing , computer vision , operating system
A large part of the data on the World Wide Web is hidden behind form-like interfaces. These interfaces interact with a hidden back-end database to provide answers to user queries. Generating a uniform random sample of this hidden database by using only the publicly available interface gives us access to the underlying data distribution. In this paper, we propose a random walk scheme over the query space provided by the interface to sample such databases. We discuss variants where the query space is visualized as a fixed and random ordering of attributes. We also propose techniques to further improve the sample quality by using a probabilistic rejection based approach. We conduct extensive experiments to illustrate the accuracy and efficiency of our techniques.
Accelerating Research
Robert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom
Address
John Eccles HouseRobert Robinson Avenue,
Oxford Science Park, Oxford
OX4 4GP, United Kingdom