The most probable database problem

Dan Suciu

The most probable database problem

Dan Suciu

2014

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

This paper proposes a novel inference task for probabilistic databases: the most probable database (MPD) problem. The MPD is the most probable deterministic database where a given query or constraint is true. We highlight two distinctive applications, in database repair of key and dependency constraints, and in finding most probable explanations in statistical relational learning. The MPD problem raises new theoretical questions, such as the possibility of a dichotomy theorem for MPD, classifying queries as being either PTIME or NP-hard. We show that such a dichotomy would diverge from dichotomies for other inference tasks. We then prove a dichotomy for queries that represent unary functional dependency constraints. Finally, we discuss symmetric probabilities and the opportunities for lifted inference.

Key takeaways

Moreover, we show that data repair and cleaning problems [5,16] on probabilistic databases are natural instances of the general MPD task.
Asymmetric MPD This is the most general setting, where the probabilistic database can contain arbitrary tuple probabilities.
We now highlight two applications of MPD, one in probabilistic databases, and one in statistical relational learning.
Given a probabilistic database that expresses the confidence we have in the correctness of each tuple, we can compute the most probable database that adheres to the data constraints Q. MPD thus provides a principled probabilistic framework for repair.
We will show that computing the MPD for query Q match is tractable with any probabilistic database.

Dan Suciu

2012

We study the complexity of computing a query on a probabilistic database. We consider unions of conjunctive queries, UCQ, which are equivalent to positive, existential First Order Logic sentences, and also to non-recursive datalog programs. The tuples in the database are independent random events. We prove the following dichotomy theorem. For every UCQ query, either its probability can be computed in polynomial time in the size of the database, or is #P-hard. Our result also has applications to the problem of computing the probability of positive, Boolean expressions, and establishes a dichotomy for such classes based on their structure. For the tractable case, we give a very simple algorithm that alternates between two steps: applying the inclusion/exclusion formula, and removing one existential variable. A key and novel feature of this algorithm is that it avoids computing terms that cancel out in the inclusion/exclusion formula, in other words it only computes those terms whose Mobius function in an appropriate lattice is non-zero. We show that this simple feature is a key ingredient needed to ensure completeness. For the hardness proof, we give a reduction from the counting problem for positive, partitioned 2CNF, which is known to be #P-complete. The hardness proof is non-trivial, and combines techniques from logic, classical algebra, and analysis.

Log In

The most probable database problem

Sign up for access to the world's latest research

Abstract

Key takeaways

Related papers

Related topics