Premium
AN EFFICIENT REPRESENTATION MODEL OF DISTANCE DISTRIBUTION BETWEEN UNCERTAIN OBJECTS
Author(s) -
Hung Edward,
Xiao Lurong,
Hung Regant Y.S.
Publication year - 2012
Publication title -
computational intelligence
Language(s) - English
Resource type - Journals
SCImago Journal Rank - 0.353
H-Index - 52
eISSN - 1467-8640
pISSN - 0824-7935
DOI - 10.1111/j.1467-8640.2012.00440.x
Subject(s) - range (aeronautics) , computer science , outlier , uncertain data , cluster analysis , data mining , computation , pairwise comparison , object (grammar) , distance measures , probability distribution , range query (database) , variance (accounting) , gaussian , algorithm , pattern recognition (psychology) , artificial intelligence , mathematics , statistics , search engine , information retrieval , physics , quantum mechanics , materials science , accounting , sargable , web search query , business , composite material
In this paper, we consider the problem of efficient computation of distance between uncertain objects. In many real life applications, data like sensor readings and weather forecasts are usually uncertain when they are collected or produced. An uncertain object has a probability distribution function (PDF) to represent the probability that it is actually located in a particular location. A fast and accurate distance computation between uncertain objects is important to many uncertain query evaluation (e.g., range queries and nearest‐neighbor queries) and uncertain data mining tasks (e.g., classifications, clustering, and outlier detection). However, existing approaches involve distance computations between samples of two objects, which is very computationally intensive. On one hand, it is expensive to calculate and store the actual distribution of the possible distance values between two uncertain objects. On the other hand, the expected distance (the weighted average of the pairwise distances among samples of two uncertain objects) provides very limited information and also restricts the definitions and usefulness of queries and mining tasks. In this paper, we propose several approaches to calculate the mean of the actual distance distribution and approximate its variance. Based on these, we suggest that the actual distance distribution could be approximated using a standard distribution like Gaussian or Gamma distribution. Experiments on real data and synthetic data show that our approach produces an approximation in a very short time with acceptable accuracy (about 90% ). We suggest that it is practical for the research communities to define and develop more powerful queries and data mining tasks based on the distance distribution instead of the expected distance.