Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2007, 2007 IEEE 23rd International Conference on Data Engineering
Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed approximate probabilities, or did not scale, and it was shown recently that precise query evaluation is theoretically hard. In this paper we describe a novel approach, which computes and ranks efficiently the top-k answers to a SQL query on a probabilistic database. The restriction to top-k answers is natural, since imprecisions in the data often lead to a large number of answers of low quality, and users are interested only in the answers with the highest probabilities. The idea in our algorithm is to run in parallel several Monte-Carlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the top-k answers.
Lecture Notes in Computer Science, 2012
There has been much interest in answering top-k queries on probabilistic data in various applications such as market analysis, personalised services, and decision making. In relation to probabilistic data, the most common problem in answering top-k queries is selecting the semantics of results according to their scores and top-k probabilities. In this paper, we propose a novel top-k best probability query to obtain results which are not only the best top-k scores but also the best topk probabilities. We also introduce an efficient algorithm for top-k best probability queries without requiring the user's defined threshold. Then, the top-k best probability answer is analysed, which satisfies the semantic ranking properties of queries [3, 18] on uncertain data. The experimental studies are tested with both the real data to verify the effectiveness of the top-k best probability queries and the efficiency of our algorithm.
There has been much interest in answering top-k queries on probabilistic data in various applications such as market analysis, personalised services, and decision making. In probabilistic relational databases, the most common problem in answering top-k queries (ranking queries) is selecting the top-k result based on scores and top-k probabilities. In this paper, we firstly propose novel answers to top-k best probability queries by selecting the probabilistic tuples which have not only the best top-k scores but also the best top-k probabilities. An efficient algorithm for top-k best probability queries is introduced without requiring users to define a threshold. The top-k best probability approach is a more efficient and effective than the probability threshold approach (PT-k) [1, 2]. Second, we add the "k-best ranking score" into the set of semantic properties for ranking queries on uncertain data proposed by . Then, our proposed method is analysed, which meets the semantic ranking properties on uncertain data. In addition, it proves that the answers to the top-k best probability queries overcome drawbacks of previous definitions of the top-k queries on probabilistic data in terms of semantic ranking properties. Lastly, we conduct an extensive experimental study verifying the effectiveness of answers to the top-k best probability queries compared to PT-k queries on uncertain data and the efficiency of our algorithm against the state-of-the-art execution of the PT-k algorithm using both real and synthetic data sets.
The VLDB Journal, 2007
We describe a system that supports arbitrarily complex SQL queries with "uncertain" predicates. The query semantics is based on a probabilistic model and the results are ranked, much like in Information Retrieval. Our main focus is query evaluation. We describe an optimization algorithm that can compute efficiently most queries. We show, however, that the data complexity of some queries is #Pcomplete, which implies that these queries do not admit any efficient evaluation methods. For these queries we describe both an approximation algorithm and a Monte-Carlo simulation algorithm.
IEEE Data(base) Engineering Bulletin, 2006
We describe a system that supports arbitrarily complex SQL queries with "uncertain" predicates. The query semantics is based on a probabilis- tic model and the results are ranked, much like in Information Retrieval. Our main focus is query evaluation. We describe an optimization algo- rithm that can compute eciently most queries. We show, however, that the data complexity of some
ACM Transactions on Database Systems, 2008
Ranking and aggregation queries are widely used in data exploration, data analysis, and decision-making scenarios. While most of the currently proposed ranking and aggregation techniques focus on deterministic data, several emerging applications involve data that is unclean or uncertain. Ranking and aggregating uncertain (probabilistic) data raises new challenges in query semantics and processing, making conventional methods inapplicable. Furthermore, uncertainty imposes probability as a new ranking dimension that does not exist in the traditional settings. In this article we introduce new probabilistic formulations for top- k and ranking-aggregate queries in probabilistic databases. Our formulations are based on marriage of traditional top- k semantics with possible worlds semantics. In the light of these formulations, we construct a generic processing framework supporting both query types, and leveraging existing query processing and indexing capabilities in current RDBMSs. The fr...
2007
Top-k processing in uncertain databases is semantically and computationally different from traditional top-k processing. The interplay between query scores and data uncertainty makes traditional techniques inapplicable. We introduce URank, a system that processes new probabilistic formulations of top-k queries in uncertain databases. The new formulations are based on marriage of traditional top-k semantics with possible worlds semantics. URank encapsulates a new processing framework that leverages existing query processing capabilities, and implements efficient search strategies that integrate ranking on scores with ranking on probabilities, to obtain meaningful answers for top-k queries.
Proceedings of the 11th international conference on Extending database technology Advances in database technology - EDBT '08, 2008
Recently, many new applications, such as sensor data monitoring and mobile device tracking, raise up the issue of uncertain data management. Compared to "certain" data, the data in the uncertain database are not exact points, which, instead, often locate within a region. In this paper, we study the ranked queries over uncertain data. In fact, ranked queries have been studied extensively in traditional database literature due to their popularity in many applications, such as decision making, recommendation raising, and data mining tasks. Many proposals have been made in order to improve the efficiency in answering ranked queries. However, the existing approaches are all based on the assumption that the underlying data are exact (or certain). Due to the intrinsic differences between uncertain and certain data, these methods are designed only for ranked queries in certain databases and cannot be applied to uncertain case directly. Motivated by this, we propose novel solutions to speed up the probabilistic ranked query (PRank) over the uncertain database. Specifically, we introduce two effective pruning methods, spatial and probabilistic, to help reduce the PRank search space. Then, we seamlessly integrate these pruning heuristics into the PRank query procedure. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed approach in answering PRank queries, in terms of both wall clock time and the number of candidates to be refined.
Lecture Notes in Computer Science, 2013
Over the past decade, the two research areas of probabilistic databases and probabilistic programming have intensively studied the problem of making structured probabilistic inference scalable, but-so far-both areas developed almost independently of one another. While probabilistic databases have focused on describing tractable query classes based on the structure of query plans and data lineage, probabilistic programming has contributed sophisticated inference techniques based on knowledge compilation and lifted (first-order) inference. Both fields have developed their own variants of-both exact and approximate-top-k algorithms for query evaluation, and both investigate query optimization techniques known from SQL, Datalog, and Prolog, which all calls for a more intensive study of the commonalities and integration of the two fields. Moreover, we believe that natural-language processing and information extraction will remain a driving factor and in fact a longstanding challenge for developing expressive representation models which can be combined with structured probabilistic inference-also for the next decades to come.
Foundations and Trends® in Databases
Probabilistic data is motivated by the need to model uncertainty in large databases. Over the last twenty years or so, both the Database community and the AI community have studied various aspects of probabilistic relational data. This survey presents the main approaches developed in the literature, reconciling concepts developed in parallel by the two research communities. The survey starts with an extensive discussion of the main probabilistic data models and their relationships, followed by a brief overview of model counting and its relationship to probabilistic data. After that, the survey discusses lifted probabilistic inference, which are a suite of techniques developed in parallel by the Database and AI communities for probabilistic query evaluation. Then, it gives a short summary of query compilation, presenting some theoretical results highlighting limitations of various query evaluation techniques on probabilistic data. The survey ends with a very brief discussion of some popular probabilistic data sets, systems, and applications that build on this technology.
bvicam.ac.in
Databases today are deterministic, that is, an item is either in the database or not. Similarly, a tuple is either in the query result or not. This process of mapping the real world inherently includes ambiguities and uncertainties and is seldom perfect. In today's data-driven competitive world a wide range of applications have emerged that needs to handle very large, imprecise data sets with inherent uncertainties. Uncertain data is natural in many important real world applications like environmental surveillance, market analysis and quantitative economic research. Data uncertainty innate in these important real world applications is generally the result of factors like data randomness and incompleteness, misaligned schemas, limitations of measuring equipment, delayed data update, imprecise queries etc . Due to the importance of these applications and the rapidly increasing amount of uncertain data collected and accumulated, analyzing large collections of uncertain data has become an important task and has attracted more and more interest from the database community. Probabilistic Databases hold the promise of being a viable means for large-scale uncertainty management, increasingly being required in a large number of real world application domains . A probabilistic database is an uncertain database in which the possible worlds have associated probabilities, that is, an item belongs to the database is a probabilistic event either with tuple-existence uncertainty or with attribute-value uncertainty. However, a tuple as an answer to query is again a probabilistic event. An important aspect in tackling the research and development on uncertain data processing is the query answering techniques on uncertain and probabilistic data. Query processing in probabilistic databases remains a computational challenge as it is fundamentally more complex than other data models. There exists a rich collection of powerful, non-trivial techniques and results, some old, some very recent, that could lead to practical management techniques for probabilistic databases. However, all such techniques suffer from limitations of uncertainty inherent in result of the query. Hence, there is a need for a general probabilistic model that tackles this uncertainty at the grass root level. The basic tool for dealing with this uncertainty is probability which is defined for an event as the proportion of times that the event would occur in repetitions of essentially identical situations. Although useful and successful in many applications, probability theory is, in fact, appropriate for dealing with only a very special type of uncertainty for measuring information. Probabilistic databases are all the more susceptible to uncertainties in query results being exclusively dependent on the probabilities assigned with inherent uncertainty in the evaluation of probabilities. Thus it becomes a potential area where this fundamental problem can be addressed and a suitable correction can be made to probabilities evaluated thereof.
2006
This talk describes research done at the University of Washington on the SQL query evaluation problem on probabilistic databases. The motivation comes from managing imprecisions in data: fuzzy object matching, information extracted from text, constraint violations. There are three dimensions to the query evaluation problem: the probabilistic data model, the complexity of the SQL queries, and whether output probabilities are exact or approximated. In the simplest probabilistic data model every tuple t is an independent probabilistic event, whose probability p represents the probability that t belongs to the database. For example, in information extraction every fact (tuple t) extracted from the text has a probability p of being correct, and for any two tuples t, t their probabilities are independent. Single block SQL queries without duplicate elimination can be evaluated simply by multiplying probabilities during join operations. But when duplicate elimination or other forms of aggregations are present, then the story is more complex. For some queries we can find a query plan such that independence still holds at each projection/duplicate-elimination operator, and thus evaluate the query efficiently. But other queries are #P-hard, and it is unlikely that they can be evaluated efficiently, and there is a simple criterion to distinguish between these two kinds of queries. Moving to a slightly more complex data model, we consider the case when tuples are either independent or exclusive (disjoint). For example, in fuzzy object matching an object ''Washington U.'' in one database matches both ''University of Washington'' with probability 0.4 and ''Washington University in St. Louis'' with probability 0.3 in a second database. This can be represented by two tuples t, t with probabilities 0.4 and 0.3, which are exclusive events. Here, too, there is a crisp separation of queries that can be evaluated efficiently and those that are #P-hard. Finally, we considered a slightly different query semantics: rank the query's answers by their probabilities, and return only the top k answers. Thus, the exact output probabilities are not important, only their ranking, and only for the top k answers. This is justified in applications of imprecise data, where probabilities have little semantics and only the top answers are meaningful. We have found that a combination of Monte Carlo simulation with in-engine SQL query evaluation scales both with the data size and the query complexity.
Information Sciences, 2013
Due to the existence of uncertain data in a wide spectrum of real applications, uncertain query processing has become increasingly important, which dramatically differs from handling certain data in a traditional database. In this paper, we formulate and tackle an important query, namely probabilistic top-k dominating (PTD) query, in the uncertain database. In particular, a PTD query retrieves k uncertain objects that are expected to dynamically dominate the largest number of uncertain objects. We propose an effective pruning approach to reduce the PTD search space, and present an efficient query procedure to answer PTD queries. Furthermore, approximate PTD query processing and the case where the PTD query is issued from an uncertain query object are also discussed. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed PTD query processing approaches.
2007
Top-k processing in uncertain databases is semantically and computationally different from traditional top-k processing. The interplay between score and uncertainty makes traditional techniques inapplicable. We introduce new probabilistic formulations for top-k queries. Our formulations are based on "marriage" of traditional top-k semantics and possible worlds semantics. In the light of these formulations, we construct a framework that encapsulates a state space model and efficient query processing techniques to tackle the challenges of uncertain data settings. We prove that our techniques are optimal in terms of the number of accessed tuples and materialized search states. Our experiments show the efficiency of our techniques under different data distributions with orders of magnitude improvement over naïve materialization of possible worlds.
Lecture Notes in Computer Science, 2007
We study the evaluation of positive conjunctive queries with Boolean aggregate tests (similar to HAVING queries in SQL) on probabilistic databases. Our motivation is to handle aggregate queries over imprecise data resulting from information integration or information extraction. More precisely, we study conjunctive queries with predicate aggregates using MIN, MAX, COUNT, SUM, AVG or COUNT(DISTINCT) on probabilistic databases. Computing the precise output probabilities for positive conjunctive queries (without HAVING) is P-hard, but is in P for a restricted class of queries called safe queries. Further, for queries without self-joins either a query is safe or its data complexity is P-Hard, which shows that safe queries exactly capture tractable queries without self-joins. In this paper, for each aggregate above, we find a class of queries that exactly capture efficient evaluation for HAVING queries without self-joins. Our algorithms use a novel technique to compute the marginal distributions of elements in a semiring, which may be of independent interest.
Proceedings of the 2003 ACM SIGMOD international conference on on Management of data - SIGMOD '03, 2003
Many applications employ sensors for monitoring entities such as temperature and wind speed. A centralized database tracks these entities to enable query processing. Due to continuous changes in these values and limited resources (e.g., network bandwidth and battery power), it is often infeasible to store the exact values at all times. A similar situation exists for moving object environments that track the constantly changing locations of objects. In this environment, it is possible for database queries to produce incorrect or invalid results based upon old data. However, if the degree of error (or uncertainty) between the actual value and the database value is controlled, one can place more confidence in the answers to queries. More generally, query answers can be augmented with probabilistic estimates of the validity of the answers. In this paper we study probabilistic query evaluation based upon uncertain data. A classification of queries is made based upon the nature of the result set. For each class, we develop algorithms for computing probabilistic answers. We address the important issue of measuring the quality of the answers to these queries, and provide algorithms for efficiently pulling data from relevant sensors or moving objects in order to improve the quality of the executing queries. Extensive experiments are performed to examine the effectiveness of several data update policies.
2008 IEEE 24th International Conference on Data Engineering, 2008
In this paper, we propose a novel type of probabilistic threshold top-k queries on uncertain data, and give an exact algorithm. More details can be found in [4].
2010
Recently, many new applications, such as sensor data monitoring and mobile device tracking, raise up the issue of uncertain data management. Compared to "certain" data, the data in the uncertain database are not exact points, which, instead, often reside within a region. In this paper, we study the ranked queries over uncertain data. In fact, ranked queries have been studied extensively in traditional database literature due to their popularity in many applications, such as decision making, recommendation raising, and data mining tasks. Many proposals have been made in order to improve the efficiency in answering ranked queries. However, the existing approaches are all based on the assumption that the underlying data are exact (or certain). Due to the intrinsic differences between uncertain and certain data, these methods are designed only for ranked queries in certain databases and cannot be applied to uncertain case directly. Motivated by this, we propose novel solutions to speed up the probabilistic ranked query (PRank) with monotonic preference functions over the uncertain database. Specifically, we introduce two effective pruning methods, spatial and probabilistic pruning, to help reduce the PRank search space. A special case of PRank with linear preference functions is also studied. Then, we seamlessly integrate these pruning heuristics into the PRank query procedure. Furthermore, we propose and tackle the PRank query processing over the join of two distinct uncertain databases. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed approaches in answering PRank queries, in terms of both wall clock time and the number of candidates to be refined.
2005
The paper studies two probabilistic query evaluation problems. The general setting is that we are given a probability distribution on all possible database instances and have to compute the probability of a tuple belonging to the query’s answer. In the deterministic view problem, we are given a set of view instances and are asked to determine the probability of a tuple belonging to a query’s answer in presence of data statistics and common world knowledge. This is related to the open world assumption in query answering using views. We show that the data complexity is NP-complete and identify important cases when it becomes PTIME, and when the query can be answered by a datalog program. In the second problem we consider, the views themselves are probabilistic: with uncertainties associated with the tuples in the views. It is unclear a priori what probability distribution on instances to assume here: we argue that a certain entropy-maximization distribution is the ”right one”, and sho...
Proceedings of the 13th International Conference on Extending Database Technology - EDBT '10, 2010
There are two broad approaches to query evaluation over probabilistic databases: (1) Intensional Methods proceed by manipulating expressions over symbolic events associated with uncertain tuples. This approach is very general and can be applied to any query, but requires an expensive postprocessing phase, which involves some general-purpose probabilistic inference.
Synthesis Lectures on Data Management is edited by Tamer Özsu of the University of Waterloo. The series will publish 50-to 125 page publications on topics pertaining to data management. The scope will largely follow the purview of premier information and computer science conferences, such as ACM SIGMOD, VLDB, ICDE, PODS, ICDT, and ACM KDD. Potential topics include, but not are limited to: query languages, database system architectures, transaction management, data warehousing, XML and databases, data stream systems, wide scale data distribution, multimedia data management, data mining, and related subjects.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.