Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2012, arXiv preprint arXiv:1211.0176
Abstract: In this paper we introduce and experimentally compare alternative algorithms to join uncertain relations. Different algorithms are based on specific principles, eg, sorting, indexing, or building intermediate relational tables to apply traditional approaches. As a consequence their performance is affected by different features of the input data, and each algorithm is shown to be more efficient than the others in specific cases. In this way statistics explicitly representing the amount and kind of uncertainty in the input uncertain relations ...
Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM '06, 2006
In many applications data values are inherently uncertain. This includes moving-objects, sensors and biological databases. There has been recent interest in the development of database management systems that can handle uncertain data. Some proposals for such systems include attribute values that are uncertain. In particular, an attribute value can be modeled as a range of possible values, associated with a probability density function. Previous efforts for this type of data have only addressed simple queries such as range and nearest-neighbor queries. Queries that join multiple relations have not been addressed in earlier work despite the significance of joins in databases. In this paper we address join queries over uncertain data. We propose a semantics for the join operation, define probabilistic operators over uncertain data, and propose join algorithms that provide efficient execution of probabilistic joins. The paper focuses on an important class of joins termed probabilistic threshold joins that avoid some of the semantic complexities of dealing with uncertain data. For this class of joins we develop three sets of optimization techniques: item-level, page-level, and index-level pruning. These techniques facilitate pruning with little space and time overhead, and are easily adapted to most join algorithms. We verify the performance of these techniques experimentally.
2009
In uncertain and probabilistic databases, confidence values (or probabilities) are associated with each data item. Confidence values are assigned to query results based on combining confidences from the input data. Users may wish to apply a threshold on result confidence values, ask for the "top-k" results by confidence, or obtain results sorted by confidence. Efficient algorithms for these types of queries can be devised by exploiting properties of the input data and the combining functions for result confidences. Previous algorithms for these problems assumed sufficient memory was available for processing. In this paper, we address the problem of processing all three types of queries when sufficient memory is not available, minimizing retrieval cost. We present algorithms, theoretical guarantees, and experimental evaluation.
Lecture Notes in Computer Science, 2006
An important database primitive for commonly used feature databases is the similarity join. It combines two datasets based on some similarity predicate into one set such that the new set contains pairs of objects of the two original sets. In many different application areas, e.g. sensor databases, location based services or face recognition systems, distances between objects have to be computed based on vague and uncertain data. In this paper, we propose to express the similarity between two uncertain objects by probability density functions which assign a probability value to each possible distance value. By integrating these probabilistic distance functions directly into the join algorithms the full information provided by these functions is exploited. The resulting probabilistic similarity join assigns to each object pair a probability value indicating the likelihood that the object pair belongs to the result set. As the computation of these probability values is very expensive, we introduce an efficient join processing strategy exemplarily for the distance-range join. In a detailed experimental evaluation, we demonstrate the benefits of our probabilistic similarity join. The experiments show that we can achieve high quality join results with rather low computational cost. 2 Related Work In the past decade, a lot of work has been done in the field of similarity join processing. Recently some researchers have focused on the area of query processing of uncertain data. However, to the best of our knowledge no work has been done in the area of join processing of uncertain data. In the following, we present related work on both topics, similarity join processing and query processing of uncertain data.
Lecture Notes in Computer Science, 2006
In many applications, uncertainty and ignorance go hand in hand. Therefore, to deliver database support for effective decision making, an integrated view of uncertainty and ignorance should be taken. So far, most of the efforts attempted to capture uncertainty and ignorance with probability theory. In this paper, we discuss the weakness to capture ignorance with probability theory, and propose an approach inspired by the Dempster-Shafer theory to capture uncertainty and ignorance. Then, we present a rule to combine dependent data that are represented in different relations. Such a rule is required to perform joins in a consistent way. We illustrate that our rule is able to solve the so-called problem of information loss, which was considered as an open problem so far.
2008
Abstract This paper introduces U-relations, a succinct and purely relational representation system for uncertain databases. U-relations support attribute-level uncertainty using vertical partitioning. If we consider positive relational algebra extended by an operation for computing possible answers, a query on the logical level can be translated into, and evaluated as, a single relational algebra query on the U-relational representation.
Journal of Computer Science and Cybernetics
In this paper, we propose a new probabilistic relational database model, denote by PRDB, as an extension of the classical relational database model where the uncertainty of relational attribute values and tuples are respectively represented by finite sets and probability intervals. A probabilistic interpretation of binary relations on finite sets is proposed for the computation of their probability measures. The combination strategies on probability intervals are employed to combine attribute values and compute uncertain membership degrees of tuples in a relation. The fundamental concepts of the classical relational database model are extended and generalized for PRDB. Then, the probabilistic relational algebraic operations are formally defined accordingly in PRDB. In addition, a set of the properties of the algebraic operations in this new model also are formulated and proven.
2011
The skyline of a relation is the set of tuples that are not dominated by any other tuple in the same relation, where tuple u dominates tuple v if u is no worse than v on all the attributes of interest and strictly better on at least one attribute. Previous attempts to extend skyline queries to probabilistic databases have proposed either a weaker form of domination, which is unsuitable to univocally define the skyline, or a definition that implies algorithms with exponential complexity. In this paper we demonstrate how, given a semantics for linearly ranking probabilistic tuples, the skyline of a probabilistic relation can be univocally defined. Our approach preserves the three fundamental properties of skyline: 1) it equals the union of all top-1 results of monotone scoring functions, 2) it requires no additional parameter to be specified, and 3) it is insensitive to actual attribute scales. We also detail efficient sequential and index-based algorithms.
Computer and Information Science, 2015
In the last years, uncertainty management became an important aspect as the presence of uncertain data increased rapidly. Due to the several advanced technologies that have been developed to record large quantity of data continuously, resulting is a data that contain errors or may be partially complete. Instead of dealing with data uncertainty by removing it, we must deal with it as a source of information. To deal with this data, database management system should have special features to handle uncertain data. The aim of this paper is twofold: on one hand, to introduce some main concepts of uncertainty in database by focusing on different data management issues in uncertain databases such as join and query processing, database integration, indexing uncertain data, security and information leakage and representation formalisms. On the other hand, to provide a survey of the current database management systems dealing with uncertain data, presenting their features and comparing them.
2008
The Trio project at Stanford for managing data, uncertainty, and lineage is developed on top of a conventional DBMS. Uncertain data with lineage is encoded in relational tables, and Trio queries are translated to SQL queries on the encoding. Such a layered approach reaps significant benefits in terms of architectural simplicity, and the ability to use an off-the-shelf query processing engine. In this paper, we present special-purpose indexes and statistics that complement the layered approach to further enhance its performance. First, we identify a well-defined structure of Trio queries, relations, and their encoding that can be exploited by the underlying query optimizer to improve the performance using Trio's layered approach. We propose several mechanisms for indexing Trio's uncertain relations and study when these indexes are useful. We then present an interesting order, and an associated operator, which are especially useful to consider when composing query plans. The decision of which query plan to use for a Trio query is dictated by various statistical properties of the input data. We identify the statistical data that can guide the underlying optimizer, and design histograms that enable estimating the statistics accurately.
Information Sciences, 2003
Information from which knowledge can be discovered is frequently distributed due to having been recorded at different times or to having arisen from different sources. Such information is often subject to both imprecision and uncertainty. The Dempster-Shafer representation of evidence offers a way of representing uncertainty in the presence of imprecision, and may therefore be used to provide a mechanism for storing imprecise and uncertain information in databases. We consider an extended relational data model that allows the imprecision and uncertainty associated with attribute values to be quantified using a mass function distribution. When a query is executed, it may be necessary to combine imprecise and uncertain data from distributed sources in order to answer that query. A mechanism is therefore required both for combining the data and for generating measures of uncertainty to be attached to the (imprecise) combined data. In this paper we provide such a mechanism based on aggregation of evidence. We show first how this mechanism can be used to resolve inconsistencies and hence provide an essential database capability to perform the operations necessary to respond to queries on imprecise and uncertain data. We go on to exploit the aggregation operator in an attribute-driven approach to provide information on properties of and patterns in the data. This is fundamental to rule discovery, and hence such an aggregation operator provides a facility that is a central requirement in providing a distributed information system with the capability to perform the operations necessary for Knowledge Discovery.
The Vldb Journal, 2009
In general terms, an uncertain relation encodes a set of possible certain relations. There are many ways to represent uncertainty, ranging from alternative values for attributes to rich constraint languages. Among the possible models for uncertain data, there is a tension between simple and intuitive models, which tend to be incomplete, and complete models, which tend to be nonintuitive and more complex than necessary for many applications. We present a space of models for representing uncertain data based on a variety of uncertainty constructs and tuple-existence constraints. We explore a number of properties and results for these models. We study completeness of the models, as well as closure under relational operations, and we give results relating closure and completeness. We then examine whether different models guarantee unique representations of uncertain data, and for those models that do not, we provide complexity results and algorithms for testing equivalence of representations. The next problem we consider is that of minimizing the size of representation of models, showing that minimizing the number of tuples also minimizes the size of constraints. We show that minimization is intractable in general and study the more restricted problem of maintaining minimality incrementally when performing operations. Finally, we present several results on the problem of approximating uncertain data in an insufficiently expressive model.
2009
Abstract The ability to flexibly compose confidence computation with the operations of relational algebra is an important feature of probabilistic database query languages. Computing confidences is computationally hard, however, and has to be approximated in practice.
ACM Transactions on Database Systems, 1996
Although the relational model for databases provides a great range of advantages over other data models, it lacks a comprehensive way to handle incomplete and uncertain data. Uncertainty in data values, however, is pervasive in all real-world environments and has received much attention in the literature. Several methods have been proposed for incorporating uncertain data into relational databases. However, the current approaches have many shortcomings and have not established an acceptable extension of the relational model. In this paper, we propose a consistent extension of the relational model. We present a revised relational structure and extend the relational algebra. The extended algebra is shown to be closed, a consistent extension of the conventional relational algebra, and reducible to the latter.
Proceedings of the VLDB Endowment, 2010
Set similarity join has played an important role in many real-world applications such as data cleaning, near duplication detection, data integration, and so on. In these applications, set data often contain noises and are thus uncertain and imprecise. In this paper, we model such probabilistic set data on two uncertainty levels, that is, set and element levels. Based on them, we investigate the problem of probabilistic set similarity join (PS 2 J) over two probabilistic set databases, under the possible worlds semantics. To efficiently process the PS 2 J operator, we first reduce our problem by condensing the possible worlds, and then propose effective pruning techniques, including Jaccard distance pruning, probability upper bound pruning, and aggregate pruning, which can filter out false alarms of probabilistic set pairs, with the help of indexes and our designed synopses. We demonstrate through extensive experiments the PS 2 J processing performance on both real and synthetic data. J dist(r , s ) ≥ J dist(pivr i , s ) − J dist(r , pivr i ) ≥ J dist(pivr i , s ) − L(r , pivr i )
2007
After the Twente Data Management workshop on Uncertainty in Databases held at the university of Twente in June 2006, the speakers and participants expressed their wish for a workshop on the same topic colocated with a large, international conference. This Management of Uncertain Data workshop, colocated with the international conference on Very Large DataBases (VLDB) is the result of this wish. We received 9 submissions from all over the world. Each of these submissions was reviewed by at least 3 different reviewers, resulting in 6 accepted papers for the workshop. In addition, we have 2 invited talks. The first talk Combining Tuple and Attribute Uncertainty in Probabilistic Databases by Lise Getoor from the University of Maryland, and the second talk Supporting Probabilistic Data in Relational Databases by Sunil Prabhakar from Purdue University. We would like to thank the PC members for their effort in reviewing the papers and of course the authors of all submitted papers for their work. We also would like to thank the Centre for Telematics and Information Technology (CTIT) for sponsoring the proceedings. Last, but not least, we would like to thank the VLDB organizers for their support in organizing this workshop.
In this extended abstract we apply the notion of skyline to the case of probabilistic relations including correlation among tuples. In particular, we consider the relevant case of the x-relation model, consisting of a set of generation rules specifying the mutual exclusion of tuples. We show how our definitions apply to different ranking semantics and analyze the time complexity for the resolution of skyline queries.
bvicam.ac.in
Databases today are deterministic, that is, an item is either in the database or not. Similarly, a tuple is either in the query result or not. This process of mapping the real world inherently includes ambiguities and uncertainties and is seldom perfect. In today's data-driven competitive world a wide range of applications have emerged that needs to handle very large, imprecise data sets with inherent uncertainties. Uncertain data is natural in many important real world applications like environmental surveillance, market analysis and quantitative economic research. Data uncertainty innate in these important real world applications is generally the result of factors like data randomness and incompleteness, misaligned schemas, limitations of measuring equipment, delayed data update, imprecise queries etc . Due to the importance of these applications and the rapidly increasing amount of uncertain data collected and accumulated, analyzing large collections of uncertain data has become an important task and has attracted more and more interest from the database community. Probabilistic Databases hold the promise of being a viable means for large-scale uncertainty management, increasingly being required in a large number of real world application domains . A probabilistic database is an uncertain database in which the possible worlds have associated probabilities, that is, an item belongs to the database is a probabilistic event either with tuple-existence uncertainty or with attribute-value uncertainty. However, a tuple as an answer to query is again a probabilistic event. An important aspect in tackling the research and development on uncertain data processing is the query answering techniques on uncertain and probabilistic data. Query processing in probabilistic databases remains a computational challenge as it is fundamentally more complex than other data models. There exists a rich collection of powerful, non-trivial techniques and results, some old, some very recent, that could lead to practical management techniques for probabilistic databases. However, all such techniques suffer from limitations of uncertainty inherent in result of the query. Hence, there is a need for a general probabilistic model that tackles this uncertainty at the grass root level. The basic tool for dealing with this uncertainty is probability which is defined for an event as the proportion of times that the event would occur in repetitions of essentially identical situations. Although useful and successful in many applications, probability theory is, in fact, appropriate for dealing with only a very special type of uncertainty for measuring information. Probabilistic databases are all the more susceptible to uncertainties in query results being exclusively dependent on the probabilities assigned with inherent uncertainty in the evaluation of probabilities. Thus it becomes a potential area where this fundamental problem can be addressed and a suitable correction can be made to probabilities evaluated thereof.
2007
Abstract There has been a recent surge in work in probabilistic databases, propelled in large part by the huge increase in noisy data sources-sensor data, experimental data, data from uncurated sources, and many others. There is a growing need to be able to flexibly represent the uncertainties in the data, and to efficiently query the data. Building on existing probabilistic database work, we present a unifying framework which allows a flexible representation of correlated tuple and attribute level uncertainties.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.