Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1995
We consider an unreliabable database as a random variable defined from a relational database with various probabilistic models. For a given query Q, we define its reliability on a database D15, pQ (lIll), as the probability that the answer to Q on an unreliable random instance coincides with the answer to Q on DB. We investigate the computational complexity of computing PQ (.DB), when Q is defined in various logic-based languages. We show that pQ (DB) is computable in polynomial time when Q is defined in first-order logic and that PQ (Ill?) is P#p computable when Q is defined in Datalog. We then discuss possible ways of estimating the reliability y for natural distributions. 1 Introduction Consider a relational database built by many transactions. As with most other physical processes, transactions succeed with high probabilities only, and after enough time there is a discrepancy between the expected database DB and the real one DB'. Most users know that their database contains wrong data and that some correct data is missing. This situation may not influence the answer to some query Q but answers to some other query Q' may be wrong. We address the question on what is an error model for databases and what is the reliability of a query.
The VLDB Journal, 2009
We study the evaluation of positive conjunctive queries with Boolean aggregate tests (similar to HAVING in SQL) on probabilistic databases. More precisely, we study conjunctive queries with predicate aggregates on probabilistic databases where the aggregation function is one of MIN, MAX, EXISTS, COUNT, SUM, AVG, or COUNT(DISTINCT) and the comparison function is one of =, , ≥, >, ≤, or < . The complexity of evaluating a HAVING query depends on the aggregation function, α, and the comparison function, θ. In this paper, we establish a set of trichotomy results for conjunctive queries with HAVING predicates parametrized by (α, θ). For such queries (without self joins), one of the following three statements is true: (1) The exact evaluation problem has P-time data complexity. In this case, we call the query safe.
2014
This paper proposes a novel inference task for probabilistic databases: the most probable database (MPD) problem. The MPD is the most probable deterministic database where a given query or constraint is true. We highlight two distinctive applications, in database repair of key and dependency constraints, and in finding most probable explanations in statistical relational learning. The MPD problem raises new theoretical questions, such as the possibility of a dichotomy theorem for MPD, classifying queries as being either PTIME or NP-hard. We show that such a dichotomy would diverge from dichotomies for other inference tasks. We then prove a dichotomy for queries that represent unary functional dependency constraints. Finally, we discuss symmetric probabilities and the opportunities for lifted inference.
Journal of Computer and System Sciences, 2014
We address the issue of incorporating a particular yet expressive form of integrity constraints (namely, denial constraints) into probabilistic databases. To this aim, we move away from the common way of giving semantics to probabilistic databases, which relies on considering a unique interpretation of the data, and address two fundamental problems: consistency checking and query evaluation. The former consists in verifying whether there is an interpretation which conforms to both the marginal probabilities of the tuples and the integrity constraints. The latter is the problem of answering queries under a "cautious" paradigm, taking into account all interpretations of the data in accordance with the constraints. In this setting, we investigate the complexity of the above-mentioned problems, and identify several tractable cases of practical relevance.
IEEE Data(base) Engineering Bulletin, 2006
We describe a system that supports arbitrarily complex SQL queries with "uncertain" predicates. The query semantics is based on a probabilis- tic model and the results are ranked, much like in Information Retrieval. Our main focus is query evaluation. We describe an optimization algo- rithm that can compute eciently most queries. We show, however, that the data complexity of some
… Twente Data Management Workshop TDM'06, 2006
This talk describes research done at the University of Washington on the SQL query evaluation problem on probabilistic databases. The motivation comes from managing imprecisions in data: fuzzy object matching, information extracted from text, constraint violations. There are three dimensions to the query evaluation problem: the probabilistic data model, the complexity of the SQL queries, and whether output probabilities are exact or approximated. In the simplest probabilistic data model every tuple t is an independent probabilistic event, whose probability p represents the probability that t belongs to the database. For example, in information extraction every fact (tuple t) extracted from the text has a probability p of being correct, and for any two tuples t, t their probabilities are independent. Single block SQL queries without duplicate elimination can be evaluated simply by multiplying probabilities during join operations. But when duplicate elimination or other forms of aggregations are present, then the story is more complex. For some queries we can find a query plan such that independence still holds at each projection/duplicate-elimination operator, and thus evaluate the query efficiently. But other queries are #P-hard, and it is unlikely that they can be evaluated efficiently, and there is a simple criterion to distinguish between these two kinds of queries. Moving to a slightly more complex data model, we consider the case when tuples are either independent or exclusive (disjoint). For example, in fuzzy object matching an object ''Washington U.'' in one database matches both ''University of Washington'' with probability 0.4 and ''Washington University in St. Louis'' with probability 0.3 in a second database. This can be represented by two tuples t, t with probabilities 0.4 and 0.3, which are exclusive events. Here, too, there is a crisp separation of queries that can be evaluated efficiently and those that are #P-hard. Finally, we considered a slightly different query semantics: rank the query's answers by their probabilities, and return only the top k answers. Thus, the exact output probabilities are not important, only their ranking, and only for the top k answers. This is justified in applications of imprecise data, where probabilities have little semantics and only the top answers are meaningful. We have found that a combination of Monte Carlo simulation with in-engine SQL query evaluation scales both with the data size and the query complexity.
Lecture Notes in Computer Science, 2007
We study the evaluation of positive conjunctive queries with Boolean aggregate tests (similar to HAVING queries in SQL) on probabilistic databases. Our motivation is to handle aggregate queries over imprecise data resulting from information integration or information extraction. More precisely, we study conjunctive queries with predicate aggregates using MIN, MAX, COUNT, SUM, AVG or COUNT(DISTINCT) on probabilistic databases. Computing the precise output probabilities for positive conjunctive queries (without HAVING) is P-hard, but is in P for a restricted class of queries called safe queries. Further, for queries without self-joins either a query is safe or its data complexity is P-Hard, which shows that safe queries exactly capture tractable queries without self-joins. In this paper, for each aggregate above, we find a class of queries that exactly capture efficient evaluation for HAVING queries without self-joins. Our algorithms use a novel technique to compute the marginal distributions of elements in a semiring, which may be of independent interest.
Proceedings of the VLDB Endowment, 2014
A database is called uncertain if two or more tuples of the same relation are allowed to agree on their primary key. Intuitively, such tuples act as alternatives for each other. A repair (or possible world) of such uncertain database is obtained by selecting a maximal number of tuples without ever selecting two tuples of the same relation that agree on their primary key. For a Boolean query q , the problem CERTAINTY( q ) takes as input an uncertain database db and asks whether q evaluates to true on every repair of db. In recent years, the complexity of CERTAINTY( q ) has been studied under different restrictions on q . These complexity studies have assumed no restrictions on the uncertain databases that are input to CERTAINTY( q ). In practice, however, it may be known that these input databases are partially consistent, in the sense that they satisfy some dependencies (e.g., functional dependencies). In this article, we introduce the problem CERTAINTY( q ) in the presence of a set...
2010
Abstract Probabilistic databases hold promise of being a viable means for large-scale uncertainty management, increasingly needed in a number of real world applications domains. However, query evaluation in probabilistic databases remains a computational challenge. Prior work on efficient exact query evaluation in probabilistic databases has largely concentrated on query-centric formulations (eg, safe plans, hierarchical queries), in that, they only consider characteristics of the query and not the data in the database.
2010
A natural way for capturing uncertainty in the relational data model is by allowing relations that violate their primary key. A repair of such relation is obtained by selecting a maximal number of tuples without ever selecting two tuples that agree on their primary key. Given a Boolean query q, CERTAINTY(q) is the problem that takes as input a relational database and asks whether q evaluates to true on every repair of that database. In recent years, CERTAINTY(q) has been studied primarily for conjunctive queries. Conditions have been determined under which CERTAINTY(q) is coNP-complete, first-order expressible, or not first-order expressible. A remaining open question was whether there exist conjunctive queries q without self-join such that CERTAINTY(q) is in PTIME but not first-order expressible. We answer this question affirmatively.
2008
Abstract We study complexity and approximation of queries in an expressive query language for probabilistic databases. The language studied supports the compositional use of confidence computation. It allows for a wide range of new use cases, such as the computation of conditional probabilities and of selections based on predicates that involve marginal and conditional probabilities. These features have important applications in areas such as data cleaning and the processing of sensor data.
2007
We show that for every conjunctive query, the complexity of evaluating it on a probabilistic database is either PTIME or #P-complete, and we give an algorithm for deciding whether a given conjunctive query is PTIME or #P-complete. The dichotomy property is a fundamental result on query evaluation on probabilistic databases and it gives a complete classification of the complexity of conjunctive queries.
Journal of Computer and System Sciences, 2011
We review in this paper some recent yet fundamental results on evaluating queries over probabilistic databases. While one can see this problem as a special instance of general purpose probabilistic inference, we describe in this paper two key database specific techniques that significantly reduce the complexity of query evaluation on probabilistic databases. The first is the separation of the query and the data: we show here that by doing so, one can identify queries whose data complexity is #P-hard, and queries whose data complexity is in PTIME. The second is the aggressive use of previously computed query results (materialized views): in particular, by rewriting a query in terms of views, one can reduce its complexity from #P-complete to PTIME. We describe a notion of a partial representation for views, show how to validated it based on the view definition, then show how to use it during query evaluation.
The VLDB Journal, 2007
We describe a system that supports arbitrarily complex SQL queries with "uncertain" predicates. The query semantics is based on a probabilistic model and the results are ranked, much like in Information Retrieval. Our main focus is query evaluation. We describe an optimization algorithm that can compute efficiently most queries. We show, however, that the data complexity of some queries is #Pcomplete, which implies that these queries do not admit any efficient evaluation methods. For these queries we describe both an approximation algorithm and a Monte-Carlo simulation algorithm.
We propose an extension of possibilistic databases that also includes provenance. The introduction of provenance makes our model closed under selection with equalities, projection and join. In addition the computation of query computing with possibilities is polynomial, in contrast with current models that combine provenance with probabilities and have #P complexity.
2012
Abstract While extensive work has been done on evaluating queries over tuple-independent probabilistic databases, query evaluation over correlated data has received much less attention even though the support for correlations is essential for many natural applications of probabilistic databases, eg, information extraction, data integration, computer vision, etc.
2010
Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a general-purpose inference engine at a high cost. This paper proposes a new approach by which every query is evaluated like a safe query inside the database engine, by using a new method called dissociation. A dissociated query is obtained by adding extraneous variables to some atoms until the query becomes safe. We show that the probability of the original query and that of the dissociated query correspond to two well-known scoring functions on graphs, namely graph reliability (which is #P-hard), and the propagation score (which is related to PageRank and is in PTIME): When restricted to graphs, standard query probability is graph reliability, while the dissociated probability is the propagation score. We define a propagation score for conjunctive queries without self-joins and prove (i) that it is is always an upper bound for query reliability, and (ii) that both scores coincide for all safe queries. Given the widespread and successful use of graph propagation methods in practice, we argue for the dissociation method as a good and efficient way to rank probabilistic query results, especially for those queries which are highly intractable for exact probabilistic inference.
2010
Abstract This paper introduces a deterministic approximation algorithm with error guarantees for computing the probability of propositional formulas over discrete random variables. The algorithm is based on an incremental compilation of formulas into decision diagrams using three types of decompositions: Shannon expansion, independence partitioning, and product factorization.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.