The reliability of queries (extended abstract)

Michel de Rougemont

The reliability of queries (extended abstract)

1995

Abstract

We consider an unreliabable database as a random variable defined from a relational database with various probabilistic models. For a given query Q, we define its reliability on a database D15, pQ (lIll), as the probability that the answer to Q on an unreliable random instance coincides with the answer to Q on DB. We investigate the computational complexity of computing PQ (.DB), when Q is defined in various logic-based languages. We show that pQ (DB) is computable in polynomial time when Q is defined in first-order logic and that PQ (Ill?) is P#p computable when Q is defined in Datalog. We then discuss possible ways of estimating the reliability y for natural distributions. 1 Introduction Consider a relational database built by many transactions. As with most other physical processes, transactions succeed with high probabilities only, and after enough time there is a discrepancy between the expected database DB and the real one DB'. Most users know that their database contains wrong data and that some correct data is missing. This situation may not influence the answer to some query Q but answers to some other query Q' may be wrong. We address the question on what is an error model for databases and what is the reliability of a query.

This talk describes research done at the University of Washington on the SQL query evaluation problem on probabilistic databases. The motivation comes from managing imprecisions in data: fuzzy object matching, information extracted from text, constraint violations. There are three dimensions to the query evaluation problem: the probabilistic data model, the complexity of the SQL queries, and whether output probabilities are exact or approximated. In the simplest probabilistic data model every tuple t is an independent probabilistic event, whose probability p represents the probability that t belongs to the database. For example, in information extraction every fact (tuple t) extracted from the text has a probability p of being correct, and for any two tuples t, t their probabilities are independent. Single block SQL queries without duplicate elimination can be evaluated simply by multiplying probabilities during join operations. But when duplicate elimination or other forms of aggregations are present, then the story is more complex. For some queries we can find a query plan such that independence still holds at each projection/duplicate-elimination operator, and thus evaluate the query efficiently. But other queries are #P-hard, and it is unlikely that they can be evaluated efficiently, and there is a simple criterion to distinguish between these two kinds of queries. Moving to a slightly more complex data model, we consider the case when tuples are either independent or exclusive (disjoint). For example, in fuzzy object matching an object ''Washington U.'' in one database matches both ''University of Washington'' with probability 0.4 and ''Washington University in St. Louis'' with probability 0.3 in a second database. This can be represented by two tuples t, t with probabilities 0.4 and 0.3, which are exclusive events. Here, too, there is a crisp separation of queries that can be evaluated efficiently and those that are #P-hard. Finally, we considered a slightly different query semantics: rank the query's answers by their probabilities, and return only the top k answers. Thus, the exact output probabilities are not important, only their ranking, and only for the top k answers. This is justified in applications of imprecise data, where probabilities have little semantics and only the top answers are meaningful. We have found that a combination of Monte Carlo simulation with in-engine SQL query evaluation scales both with the data size and the query complexity.

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

Log In

The reliability of queries (extended abstract)

Sign up for access to the world's latest research

Abstract

Related papers

Related topics