Probabilistic Similarity Join on Uncertain Data

Kriegel, Hans-Peter; Kunath, Peter; Pfeifle, Martin; Renz, Matthias

Probabilistic Similarity Join on Uncertain Data

Matthias Renz

2006, Lecture Notes in Computer Science

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

An important database primitive for commonly used feature databases is the similarity join. It combines two datasets based on some similarity predicate into one set such that the new set contains pairs of objects of the two original sets. In many different application areas, e.g. sensor databases, location based services or face recognition systems, distances between objects have to be computed based on vague and uncertain data. In this paper, we propose to express the similarity between two uncertain objects by probability density functions which assign a probability value to each possible distance value. By integrating these probabilistic distance functions directly into the join algorithms the full information provided by these functions is exploited. The resulting probabilistic similarity join assigns to each object pair a probability value indicating the likelihood that the object pair belongs to the result set. As the computation of these probability values is very expensive, we introduce an efficient join processing strategy exemplarily for the distance-range join. In a detailed experimental evaluation, we demonstrate the benefits of our probabilistic similarity join. The experiments show that we can achieve high quality join results with rather low computational cost. 2 Related Work In the past decade, a lot of work has been done in the field of similarity join processing. Recently some researchers have focused on the area of query processing of uncertain data. However, to the best of our knowledge no work has been done in the area of join processing of uncertain data. In the following, we present related work on both topics, similarity join processing and query processing of uncertain data.

Figures (7)

Fig. 3. Distance-range join based on the expected distance. 4 Probabilistic Similarity Join on Uncertain Data

Fig. 4. Database Integration of Uncertain Data.

Fig. 7. Runtime performance for varying number of sample clusters

Xiang Lian

Proceedings of the VLDB Endowment, 2010

Set similarity join has played an important role in many real-world applications such as data cleaning, near duplication detection, data integration, and so on. In these applications, set data often contain noises and are thus uncertain and imprecise. In this paper, we model such probabilistic set data on two uncertainty levels, that is, set and element levels. Based on them, we investigate the problem of probabilistic set similarity join (PS 2 J) over two probabilistic set databases, under the possible worlds semantics. To efficiently process the PS 2 J operator, we first reduce our problem by condensing the possible worlds, and then propose effective pruning techniques, including Jaccard distance pruning, probability upper bound pruning, and aggregate pruning, which can filter out false alarms of probabilistic set pairs, with the help of indexes and our designed synopses. We demonstrate through extensive experiments the PS 2 J processing performance on both real and synthetic data. J dist(r , s ) ≥ J dist(pivr i , s ) − J dist(r , pivr i ) ≥ J dist(pivr i , s ) − L(r , pivr i )

Log In

Probabilistic Similarity Join on Uncertain Data

Sign up for access to the world's latest research

Abstract

Related papers

Related topics

Related papers