Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2008
The Trio project at Stanford for managing data, uncertainty, and lineage is developed on top of a conventional DBMS. Uncertain data with lineage is encoded in relational tables, and Trio queries are translated to SQL queries on the encoding. Such a layered approach reaps significant benefits in terms of architectural simplicity, and the ability to use an off-the-shelf query processing engine. In this paper, we present special-purpose indexes and statistics that complement the layered approach to further enhance its performance. First, we identify a well-defined structure of Trio queries, relations, and their encoding that can be exploited by the underlying query optimizer to improve the performance using Trio's layered approach. We propose several mechanisms for indexing Trio's uncertain relations and study when these indexes are useful. We then present an interesting order, and an associated operator, which are especially useful to consider when composing query plans. The decision of which query plan to use for a Trio query is dictated by various statistical properties of the input data. We identify the statistical data that can guide the underlying optimizer, and design histograms that enable estimating the statistics accurately.
2007
Trio is a new kind of database system that supports data, uncertainty, and lineage in a fully integrated manner. The first Trio prototype, dubbed Trio-One, is built on top of a conventional DBMS using data and query translation techniques together with a small number of stored procedures. This paper describes Trio-One's translation scheme and system architecture, showing how it efficiently and easily supports the Trio data model and query language.
bvicam.ac.in
Databases today are deterministic, that is, an item is either in the database or not. Similarly, a tuple is either in the query result or not. This process of mapping the real world inherently includes ambiguities and uncertainties and is seldom perfect. In today's data-driven competitive world a wide range of applications have emerged that needs to handle very large, imprecise data sets with inherent uncertainties. Uncertain data is natural in many important real world applications like environmental surveillance, market analysis and quantitative economic research. Data uncertainty innate in these important real world applications is generally the result of factors like data randomness and incompleteness, misaligned schemas, limitations of measuring equipment, delayed data update, imprecise queries etc . Due to the importance of these applications and the rapidly increasing amount of uncertain data collected and accumulated, analyzing large collections of uncertain data has become an important task and has attracted more and more interest from the database community. Probabilistic Databases hold the promise of being a viable means for large-scale uncertainty management, increasingly being required in a large number of real world application domains . A probabilistic database is an uncertain database in which the possible worlds have associated probabilities, that is, an item belongs to the database is a probabilistic event either with tuple-existence uncertainty or with attribute-value uncertainty. However, a tuple as an answer to query is again a probabilistic event. An important aspect in tackling the research and development on uncertain data processing is the query answering techniques on uncertain and probabilistic data. Query processing in probabilistic databases remains a computational challenge as it is fundamentally more complex than other data models. There exists a rich collection of powerful, non-trivial techniques and results, some old, some very recent, that could lead to practical management techniques for probabilistic databases. However, all such techniques suffer from limitations of uncertainty inherent in result of the query. Hence, there is a need for a general probabilistic model that tackles this uncertainty at the grass root level. The basic tool for dealing with this uncertainty is probability which is defined for an event as the proportion of times that the event would occur in repetitions of essentially identical situations. Although useful and successful in many applications, probability theory is, in fact, appropriate for dealing with only a very special type of uncertainty for measuring information. Probabilistic databases are all the more susceptible to uncertainties in query results being exclusively dependent on the probabilities assigned with inherent uncertainty in the evaluation of probabilities. Thus it becomes a potential area where this fundamental problem can be addressed and a suitable correction can be made to probabilities evaluated thereof.
2007
Top-k processing in uncertain databases is semantically and computationally different from traditional top-k processing. The interplay between query scores and data uncertainty makes traditional techniques inapplicable. We introduce URank, a system that processes new probabilistic formulations of top-k queries in uncertain databases. The new formulations are based on marriage of traditional top-k semantics with possible worlds semantics. URank encapsulates a new processing framework that leverages existing query processing capabilities, and implements efficient search strategies that integrate ranking on scores with ranking on probabilities, to obtain meaningful answers for top-k queries.
International Journal of Innovative Research in Science, Engineering and Technology, 2012
Many real world applications need a database that stores probabilistic and uncertain database. Trio is a robust prototype build to store and retrieve uncertain and lineage data. It also supports some features of a relational DBMS. ULDB is an extension of relational databases with expressive construct for representing and manipulating both lineage and uncertainty. ULDB representation is complete and it permits straightforward implementation of many relational operations. Currently Trio performs only select-project-join queries and some set operations. Queries are expressed using TriQL query language. This paper highlights on how multiple aggregation can be handled in select clause in Trio system for uncertain and probabilistic data. It also highlights on how distinct clause can be used along with aggregation function. The results on the implementation of minus and intersect all clause in Trio system have been discussed. These operations allow users to use Trio system in a more flexib...
2004
It is infeasible for a sensor database to contain the exact value of each sensor at all points in time. This uncertainty is inherent in these systems due to measurement and sampling errors, and resource limitations. In order to avoid drawing erroneous conclusions based upon stale data, the use of uncertainty intervals that model each data item as a range and associated probability density function (pdf) rather than a single value has recently been proposed. Querying these uncertain data introduces imprecision into answers, in the form of probability values that specify the likeliness the answer satisfies the query. These queries are more expensive to evaluate than their traditional counterparts but are guaranteed to be correct and more informative due to the probabilities accompanying the answers. Although the answer probabilities are useful, for many applications, their precise value is less critical. In particular, for many queries it is only necessary to know whether the probability exceeds a given threshold -we term these Probabilistic Threshold Queries (PTQ). In this paper we address the efficient computation of these types of queries.
2010
Recently, many new applications, such as sensor data monitoring and mobile device tracking, raise up the issue of uncertain data management. Compared to "certain" data, the data in the uncertain database are not exact points, which, instead, often reside within a region. In this paper, we study the ranked queries over uncertain data. In fact, ranked queries have been studied extensively in traditional database literature due to their popularity in many applications, such as decision making, recommendation raising, and data mining tasks. Many proposals have been made in order to improve the efficiency in answering ranked queries. However, the existing approaches are all based on the assumption that the underlying data are exact (or certain). Due to the intrinsic differences between uncertain and certain data, these methods are designed only for ranked queries in certain databases and cannot be applied to uncertain case directly. Motivated by this, we propose novel solutions to speed up the probabilistic ranked query (PRank) with monotonic preference functions over the uncertain database. Specifically, we introduce two effective pruning methods, spatial and probabilistic pruning, to help reduce the PRank search space. A special case of PRank with linear preference functions is also studied. Then, we seamlessly integrate these pruning heuristics into the PRank query procedure. Furthermore, we propose and tackle the PRank query processing over the join of two distinct uncertain databases. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed approaches in answering PRank queries, in terms of both wall clock time and the number of candidates to be refined.
IEEE Data(base) Engineering Bulletin, 2006
We describe a system that supports arbitrarily complex SQL queries with "uncertain" predicates. The query semantics is based on a probabilis- tic model and the results are ranked, much like in Information Retrieval. Our main focus is query evaluation. We describe an optimization algo- rithm that can compute eciently most queries. We show, however, that the data complexity of some
Proceedings of the 2021 International Conference on Management of Data, 2021
Incomplete and probabilistic database techniques are principled methods for coping with uncertainty in data. Unfortunately, the class of queries that can be answered eciently over such databases is severely limited, even when advanced approximation techniques are employed. We introduce attribute-annotated uncertain databases (AU-DBs), an uncertain data model that annotates tuples and attribute values with bounds to compactly approximate an incomplete database. AU-DBs are closed under relational algebra with aggregation using an ecient evaluation semantics. Using optimizations that trade accuracy for performance, our approach scales to complex queries and large datasets, and produces accurate results.
2006
trio 2. Proposed a new scheme called ULDBs. ULDBs extend the relational model with simple forms of uncertainty that, when combined with lineage, yield nice properties and strong expressiveness [1]. 3. Proposed a SQL-based query language for ULDBs called TriQL (pronounced "treacle"). TriQL modifies the semantics of SQL to take uncertainty and lineage into account, and introduces new constructs to query uncertainty and lineage directly [2]. 4. Implemented a first working prototype of our model and language by building on top of a conventional DBMS [2].
2009
Abstract The ability to flexibly compose confidence computation with the operations of relational algebra is an important feature of probabilistic database query languages. Computing confidences is computationally hard, however, and has to be approximated in practice.
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010
The probabilistic threshold query (PTQ) is one of the most common queries in uncertain databases, where all results satisfying the query with probabilities that meet the threshold requirement are returned. PTQ is used widely in nearest-neighbor queries, range queries, ranking queries, etc. In this paper, we investigate the general PTQ for arbitrary SQL queries that involve selections, projections and joins. The uncertain database model that we use is one that combines both attribute and tuple uncertainty as well as correlations between arbitrary attribute sets. We address the PTQ optimization problem that aims at improving the efficiency of PTQ query execution by enabling alternative query plan enumeration for optimization. We propose general optimization rules as well as rules specifically for selections, projections and joins. We introduce a threshold operator (τ-operator) to the query plan and show it is generally desirable to push down the τ-operator as much as possible. Our PTQ optimizations are evaluated in a real uncertain database management system. Our experiments on both real and synthetic data sets show that the optimizations improve the PTQ query processing time.
2007
Top-k processing in uncertain databases is semantically and computationally different from traditional top-k processing. The interplay between score and uncertainty makes traditional techniques inapplicable. We introduce new probabilistic formulations for top-k queries. Our formulations are based on "marriage" of traditional top-k semantics and possible worlds semantics. In the light of these formulations, we construct a framework that encapsulates a state space model and efficient query processing techniques to tackle the challenges of uncertain data settings. We prove that our techniques are optimal in terms of the number of accessed tuples and materialized search states. Our experiments show the efficiency of our techniques under different data distributions with orders of magnitude improvement over naïve materialization of possible worlds.
Lecture Notes in Computer Science, 2011
Top-k queries allow end-users to focus on the most important (top-k) answers amongst those which satisfy the query. In traditional databases, a user defined score function assigns a score value to each tuple and a top-k query returns k tuples with the highest score. In uncertain database, top-k answer depends not only on the scores but also on the membership probabilities of tuples. Several top-k definitions covering different aspects of score-probability interplay have been proposed in recent past [10], [4], [2], [8]. Most of the existing work in this research field is focused on developing efficient algorithms for answering top-k queries on static uncertain data. Any change (insertion, deletion of a tuple or change in membership probability, score of a tuple) in underlying data forces re-computation of query answers. Such re-computations are not practical considering the dynamic nature of data in many applications. In this paper, we propose a fully dynamic data structure that uses ranking function P RF e (α) proposed by Li et al. [8] under the generally adopted model of x-relations [11]. P RF e can effectively approximate various other top-k definitions on uncertain data based on the value of parameter α. An x-relation consists of a number of xtuples, where x-tuple is a set of mutually exclusive tuples (up to a constant number) called alternatives. Each x-tuple in a relation randomly instantiates into one tuple from its alternatives. For an uncertain relation with N tuples, our structure can answer top-k queries in O(k log N) time, handles an update in O(log N) time and takes O(N) space. Finally, we evaluate practical efficiency of our structure on both synthetic and real data.
2008 IEEE 24th International Conference on Data Engineering, 2008
This paper introduces U-relations, a succinct and purely relational representation system for uncertain databases. U-relations support attribute-level uncertainty using vertical partitioning. If we consider positive relational algebra extended by an operation for computing possible answers, a query on the logical level can be translated into, and evaluated as, a single relational algebra query on the Urelation representation. The translation scheme essentially preserves the size of the query in terms of number of operations and, in particular, number of joins. Standard techniques employed in off-the-shelf relational database management systems are effective for optimizing and processing queries on U-relations. In our experiments we show that query evaluation on U-relations scales to large amounts of data with high degrees of uncertainty.
Distributed and Parallel Databases, 1993
In heterogeneous database systems, partial values have been used to resolve some schema integration problems. Performing operations on partial values may produce maybe tuples in the query result which cannot be compared. Thus, users have no way to distinguish which maybe tuple is the most possible answer. In this paper, the concept of partial values is generalized to probabilistic partial values. We propose an approach to resolve the schema integration problems using probabilistic partial values and develop a full set of extended relational operators for manipulating relations containing probabilistic partial values. With this approach, the uncertain answer tuples of a query are associated with degrees of uncertainty (represented by probabilities). That provides users a comparison among maybe tuples and a better understanding on the query results. Besides, extended selection and join are generalized to c~-selection and c~-join, respectively, which can be used to filter out maybe tuples with low probabilities-those which have probabilities smaller than a.
ABSTRACT Dynamic querying is a technique which has been used successfully to enable novice users to gain access to and insight into data in databases. Some multimedia archives (such as archives of African art) contain data which have vague locations in time and space, that is, although there is some idea of when and where the entity originated, the precise information is unknown. This uncertainty creates problems with the display and querying of the data and so the data is generally not accessible to novice users.
Proceedings of the 13th International Conference on Extending Database Technology - EDBT '10, 2010
There are two broad approaches to query evaluation over probabilistic databases: (1) Intensional Methods proceed by manipulating expressions over symbolic events associated with uncertain tuples. This approach is very general and can be applied to any query, but requires an expensive postprocessing phase, which involves some general-purpose probabilistic inference.
Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM '06, 2006
In many applications data values are inherently uncertain. This includes moving-objects, sensors and biological databases. There has been recent interest in the development of database management systems that can handle uncertain data. Some proposals for such systems include attribute values that are uncertain. In particular, an attribute value can be modeled as a range of possible values, associated with a probability density function. Previous efforts for this type of data have only addressed simple queries such as range and nearest-neighbor queries. Queries that join multiple relations have not been addressed in earlier work despite the significance of joins in databases. In this paper we address join queries over uncertain data. We propose a semantics for the join operation, define probabilistic operators over uncertain data, and propose join algorithms that provide efficient execution of probabilistic joins. The paper focuses on an important class of joins termed probabilistic threshold joins that avoid some of the semantic complexities of dealing with uncertain data. For this class of joins we develop three sets of optimization techniques: item-level, page-level, and index-level pruning. These techniques facilitate pruning with little space and time overhead, and are easily adapted to most join algorithms. We verify the performance of these techniques experimentally.
Computer and Information Science, 2015
In the last years, uncertainty management became an important aspect as the presence of uncertain data increased rapidly. Due to the several advanced technologies that have been developed to record large quantity of data continuously, resulting is a data that contain errors or may be partially complete. Instead of dealing with data uncertainty by removing it, we must deal with it as a source of information. To deal with this data, database management system should have special features to handle uncertain data. The aim of this paper is twofold: on one hand, to introduce some main concepts of uncertainty in database by focusing on different data management issues in uncertain databases such as join and query processing, database integration, indexing uncertain data, security and information leakage and representation formalisms. On the other hand, to provide a survey of the current database management systems dealing with uncertain data, presenting their features and comparing them.
2007
Uncertainty in categorical data is commonplace in many applications, including data cleaning, database integration, and biological annotation. In such domains, the correct value of an attribute is often unknown, but may be selected from a reasonable number of alternatives. Current database management systems do not provide a convenient means for representing or manipulating this type of uncertainty. In this paper we extend traditional systems to explicitly handle uncertainty in data values. We propose two index structures for efficiently searching uncertain categorical data, one based on the R-tree and another based on an inverted index structure. Using these structures, we provide a detailed description of the probabilistic equality queries they support. Experimental results using real and synthetic datasets demonstrate how these index structures can effectively improve the performance of queries through the use of internal probabilistic information.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.