Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2019, Information Systems
Certain answers are a widely accepted semantics of query answering over incomplete databases. As their computation is a coNP-hard problem, recent research has focused on developing (polynomial time) evaluation algorithms with correctness guarantees, that is, techniques computing a sound but possibly incomplete set of certain answers. The aim is to make the computation of certain answers feasible in practice, settling for under-approximations. In this paper, we present novel evaluation algorithms with correctness guarantees, which provide better approximations than current techniques, while retaining polynomial time data complexity. The central tools of our approach are conditional tables and the conditional evaluation of queries. We propose different strategies to evaluate conditions, leading to different approximation algorithms-more accurate evaluation strategies have higher running times, but they pay off with more certain answers being returned. Thus, our approach offers a suite of approximation algorithms enabling users to choose the technique that best meets their needs in terms of balance between efficiency and quality of the results.
2017
Certain answers are a widely accepted semantics of query answering over incomplete databases. Since their computation is a coNP-hard problem, recent research has focused on developing evaluation algorithms with correctness guarantees, that is, techniques computing a sound but possibly incomplete set of certain answers. In this paper, we show how novel evaluation algorithms with correctness guarantees can be developed leveraging conditional tables and the conditional evaluation of queries, while retaining polynomial time data complexity.
2017
Certain answers are a widely accepted semantics of query answering over incomplete databases. Since their computation is a coNP-hard problem, recent research has focused on developing evaluation algorithms with correctness guarantees, that is, techniques computing a sound but possibly incomplete set of certain answers. In this paper, we show how novel evaluation algorithms with correctness guarantees can be developed leveraging conditional tables and the conditional evaluation of queries, while retaining polynomial time data complexity.
Proceedings of the 22nd International Database Engineering & Applications Symposium, 2018
Incomplete information arises in many database applications, such as data integration, data exchange, inconsistency management, data cleaning, ontological reasoning, and many others. A principled way of answering queries over incomplete databases is to compute certain answers, which are query answers that can be obtained from every complete database represented by an incomplete one. For databases containing (labeled) nulls, certain answers to positive queries can be easily computed in polynomial time, but for more general queries with negation the problem becomes coNP-hard. To make query answering feasible in practice, one might resort to SQL's evaluation, but unfortunately, the way SQL behaves in the presence of nulls may result in wrong answers. Thus, on the one hand, SQL's evaluation is efficient but flawed, on the other hand, certain answers are a principled semantics but with high complexity. To deal with issue, recent research has focused on developing polynomial time ...
2019
Many database applications face the problem of querying incomplete data. In such scenarios, certain answers are a principled semantics of query answering. Unfortunately, the computation of certain query answers is a coNP-hard problem. To make query answering feasible in practice, recent research has focused on developing polynomial time algorithms computing a sound (but possibly incomplete) set of certain answers. In this paper we present a system prototype implementing a suite of algorithms to compute sound sets of certain answers. The central tools used by our system are conditional tables and the conditional evaluation of relation algebra. Different evaluation strategies can be applied, with more accurate ones having higher complexity, but returning more certain answers, thereby enabling users to choose the technique that best meets their needs in terms of balance between efficiency and quality of the results.
Proceedings of the VLDB Endowment, 2013
We present a system that computes for a query that may be incomplete, complete approximations from above and from below. We assume a setting where queries are posed over a partially complete database, that is, a database that is generally incomplete, but is known to contain complete information about specific aspects of its application domain. Which parts are complete, is described by a set of so-called table-completeness statements. Previous work led to a theoretical framework and an implementation that allowed one to determine whether in such a scenario a given conjunctive query is guaranteed to return a complete set of answers or not. With the present demonstrator we show how to reformulate the original query in such a way that answers are guaranteed to be complete. If there exists a more general complete query, there is a unique most specific one, which we find. If there exists a more specific complete query, there may even be infinitely many. In this case, we find the least specific specializations whose size is bounded by a threshold provided by the user. Generalizations are computed by a fixpoint iteration, employing an answer set programming engine. Specializations are found leveraging unification from logic programming.
Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
Queries with aggregation and arithmetic operations, as well as incomplete data, are common in real-world database, but we lack a good understanding of how they should interact. On the one hand, systems based on SQL provide ad-hoc rules for numerical nulls, on the other, theoretical research largely concentrates on the standard notions of certain and possible answers. In the presence of numerical attributes and aggregates, however, these answers are often meaningless, returning either too little or too much. Our goal is to define a principled framework for databases with numerical nulls and answering queries with arithmetic and aggregations over them. Towards this goal, we assume that missing values in numerical attributes are given by probability distributions associated with marked nulls. This yields a model of probabilistic bag databases in which tuples are not necessarily independent since nulls can repeat. We provide a general compositional framework for query answering and then concentrate on queries that resemble standard SQL with arithmetic and aggregation. We show that these queries are measurable, and their outputs have a finite representation. Moreover, since the classical forms of answers provide little information in the numerical setting, we look at the probability that numerical values in output tuples belong to specific intervals. Even though their exact computation is intractable, we show efficient approximation algorithms to compute such probabilities.
Proceedings of the 21st ACM international conference on Information and knowledge management - CIKM '12, 2012
Data completeness is an important aspect of data quality. We consider a setting, where databases can be incomplete in two ways: records may be missing and records may contain null values. We (i) formalize when the answer set of a query is complete in spite of such incompleteness, and (ii) we introduce table completeness statements, by which one can express that certain parts of a database are complete. We then study how to deduce from a set of tablecompleteness statements that a query can be answered completely. Null values as used in SQL are ambiguous. They can indicate either that no attribute value exists or that a value exists, but is unknown. We study completeness reasoning for the different interpretations. We show that in the combined case it is necessary to syntactically distinguish between different kinds of null values and present an encoding for doing that in standard SQL databases. With this technique, any SQL DBMS evaluates complete queries correctly with respect to the different meanings that nulls can carry. We study the complexity of completeness reasoning and provide algorithms that in most cases agree with the worst-case lower bounds.
Cornell University - arXiv, 2022
Queries with aggregation and arithmetic operations, as well as incomplete data, are common in real-world database, but we lack a good understanding of how they should interact. On the one hand, systems based on SQL provide ad-hoc rules for numerical nulls, on the other, theoretical research largely concentrates on the standard notions of certain and possible answers. In the presence of numerical attributes and aggregates, however, these answers are often meaningless, returning either too little or too much. Our goal is to define a principled framework for databases with numerical nulls and answering queries with arithmetic and aggregations over them. Towards this goal, we assume that missing values in numerical attributes are given by probability distributions associated with marked nulls. This yields a model of probabilistic bag databases in which tuples are not necessarily independent since nulls can repeat. We provide a general compositional framework for query answering and then concentrate on queries that resemble standard SQL with arithmetic and aggregation. We show that these queries are measurable, and their outputs have a finite representation. Moreover, since the classical forms of answers provide little information in the numerical setting, we look at the probability that numerical values in output tuples belong to specific intervals. Even though their exact computation is intractable, we show efficient approximation algorithms to compute such probabilities.
2014
As data collection continues to grow rapidly the ability to efficiently carry out exploratory searches on the data is becoming more important. An exploratory search can be modeled as an approximate query in a database: retrieve all database elements which are similar to the query. Different forms of approximate queries (using domain-dependent notions of similarity) are already popular in many applications including data cleansing, pattern recognition, bioinformatics, address matching, and Internet search. Currently, the most popular approach for approximate query processing consists of a two-step (phase) process. The first phase is called the filter phase and consists of enumerating a set of q-grams or substrings in a database. The q-grams form the inverted index and the query will use the inverted index to prune those records that are unlikely to match the query. In the second refinement phase, all database records which passed through the first phase are validated to produce the final answer. Despite showing improvement over a full table scan, the two-phase approach for approximate querying is still not practical and is not part of any well-known database management system. This is partly because the index size can be very large - sometimes bigger than the size of the database. In this thesis, we propose an algorithmic approach to selecting q-grams which will constitute the inverted index. We model the q-gram selection problem as an optimization problem and explore several models including vertex cover and feedback vertex set and discuss their trade-offs. We will also evaluate several algorithm design patterns like greedy and primal-dual and LP relaxation to solve the optimization problems. Our particular focus is on evaluating techniques not just for approximate guarantees but also on how easily (or gracefully) they be implemented and integrated into a modern relational database management system. We will demonstrate that our approach results in an index size that is bounded above the size of the database and provides no false dismissals and a low false-positive rate. We have implemented our approaches in a database management system and demonstrate how approximate queries can be posed using SQL and be efficiently processed by the system.
arXiv (Cornell University), 2023
To answer database queries over incomplete data the gold standard is finding certain answers: those that are true regardless of how incomplete data is interpreted. Such answers can be found efficiently for conjunctive queries and their unions, even in the presence of constraints. With negation added, the problem becomes intractable however. We concentrate on the complexity of certain answers under constraints, and on effficiently answering queries outside the usual classes of (unions) of conjunctive queries by means of rewriting as Datalog and first-order queries. We first notice that there are three different ways in which query answering can be cast as a decision problem. We complete the existing picture and provide precise complexity bounds on all versions of the decision problem, for certain and best answers. We then study a well-behaved class of queries that extends unions of conjunctive queries with a mild form of negation. We show that for them, certain answers can be expressed in Datalog with negation, even in the presence of functional dependencies, thus making them tractable in data complexity. We show that in general Datalog cannot be replaced by first-order logic, but without constraints such a rewriting can be done in first-order. The paper is under consideration in Theory and Practice of Logic Programming (TPLP).
ACM Transactions on Database Systems, 2014
The term naïve evaluation refers to evaluating queries over incomplete databases as if nulls were usual data values, i.e., to using the standard database query evaluation engine. Since the semantics of query answering over incomplete databases is that of certain answers, we would like to know when naïve evaluation computes them: i.e., when certain answers can be found without inventing new specialized algorithms. For relational databases it is well known that unions of conjunctive queries possess this desirable property, and results on preservation of formulae under homomorphisms tell us that within relational calculus, this class cannot be extended under the open-world assumption. Our goal here is twofold. First, we develop a general framework that allows us to determine, for a given semantics of incompleteness, classes of queries for which naïve evaluation computes certain answers. Second, we apply this approach to a variety of semantics, showing that for many classes of queries beyond unions of conjunctive queries, naïve evaluation makes perfect sense under assumptions different from open-world. Our key observations are: (1) naïve evaluation is equivalent to monotonicity of queries with respect to a semantics-induced ordering, and (2) for most reasonable semantics of incompleteness, such monotonicity is captured by preservation under various types of homomorphisms. Using these results we find classes of queries for which naïve evaluation works, e.g., positive first-order formulae for the closed-world semantics. Even more, we introduce a general relation-based framework for defining semantics of incompleteness, show how it can be used to capture many known semantics and to introduce new ones, and describe classes of first-order queries for which naïve evaluation works under such semantics.
SIAM Journal on Computing, 2014
When finding exact answers to a query over a large database is infeasible, it is natural to approximate the query by a more efficient one that comes from a class with good bounds on the complexity of query evaluation. In this paper we study such approximations for conjunctive queries. These queries are of special importance in databases, and we have a very good understanding of the classes that admit fast query evaluation, such as acyclic, or bounded (hyper)treewidth queries. We define approximations of a given query Q as queries from one of those classes that disagree with Q as little as possible. We concentrate on approximations that are guaranteed to return correct answers. We prove that for the above classes of tractable conjunctive queries, approximations always exist, and are at most polynomial in the size of the original query. This follows from general results we establish that relate closure properties of classes of conjunctive queries to the existence of approximations. We also show that in many cases, the size of approximations is bounded by the size of the query they approximate. We establish a number of results showing how combinatorial properties of queries affect properties of their approximations, study bounds on the number of approximations, as well as the complexity of finding and identifying approximations. The technical toolkit of the paper comes from the theory of graph homomorphisms, as we mainly work with tableaux of queries and characterize approximations via preorders based on the existence of homomorphisms. In particular, most of our results can be also interpreted as approximation or complexity results for directed graphs.
1999
Semistructured data occur in situations where information lacks a homogeneous structure and is incomplete. Yet, up to now the incompleteness of information has not been re ected by special features of query languages for semistructured data. Our goal is to investigate the principles of queries that allow for incomplete answers. We do not present, however, a concrete query language.
2018
Consistent query answering is a principled approach for querying inconsistent knowledge bases. It relies on the notion of a repair, that is, a maximal consistent subset of the facts in the knowledge base. One drawback of this approach is that entire facts are deleted to resolve inconsistency, even if they may still contain useful “reliable” information. To overcome this limitation, we propose a new notion of repair allowing values within facts to be updated for restoring consistency. This more finegrained repair primitive allows us to preserve more information in the knowledge base. We also introduce the notion of a universal repair, which is a compact representation of all repairs. Then, we show that consistent query answering in our framework is intractable (coNP-complete). In light of this result, we develop a polynomial time approximation algorithm for computing a sound (but possibly incomplete) set of consistent query answers.
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015
In many applications including loosely coupled cloud databases, collaborative editing and network monitoring, data from multiple sources is regularly used for query answering. For reasons such as system failures, insufficient author knowledge or network issues, data may be temporarily unavailable or generally nonexistent. Hence, not all data needed for query answering may be available. In this paper, we propose a natural class of completeness patterns, expressed by selections on database tables, to specify complete parts of database tables. We then show how to adapt the operators of relational algebra so that they manipulate these completeness patterns to compute completeness patterns pertaining to query answers. Our proposed algebra is computationally sound and complete with respect to the information that the patterns provide. We show that stronger completeness patterns can be obtained by considering not only the schema but also the database instance and we extend the algebra to take into account this additional information. We develop novel techniques to efficiently implement the computation of completeness patterns on query answers and demonstrate their scalability on real data.
Semistructured data occur in situations where information lacks a homogeneous structure and is incomplete. Yet, up to now the incompleteness of information has not been re ected by special features of query languages. Our goal is to investigate the principles of queries that allow for incomplete answers. We do not present, however, a concrete query language. Queries over classical structured data models contain a numb e r o f v ariables and constraints on these variables. An answer is a binding of the variables by elements of the database such that the constraints are satis ed. In the present paper, we loosen this concept in so far as we allow also answers that are partial, that is, not all variables in the query are bound by such an answer. Partial answers make it necessary to re ne the model of query evaluation. The rst modi cation relates to the satisfaction of constraints: under some circumstances we consider constraints involving unbound variables as satis ed. Second, in order to prevent a proliferation of answers, we only accept answers that are maximal in the sense that there are no assignments that bind more variables and satisfy the constraints of the query. Our model of query evaluation consists of two phases, a search phase and a lter phase. Semistructured databases are essentially labeled directed graphs. In the search phase, we use a query graph containing variables to match a maximal portion of the database graph. We i n v estigate three di erent semantics for query graphs, which give rise to three variants of matching. For each v ariant, we provide algorithms and complexity results. In the lter phase, the maximal matchings resulting from the search phase are subjected to constraints, which may b e weak or strong. Strong constraints require all their variables to be bound, while weak constraints do not. We describe a polynomial algorithm for evaluating a special type of queries with lter constraints, and assess the complexity o f e v aluating other queries for several kinds of constraints. In the nal part, we i n v estigate the containment problem for queries consisting only of search constraints under the di erent semantics.
Standard databases convey Reiter's closed-world as- sumption that an atom not in the database is false. This assumption is relaxed in locally closed databases that are sound but only partially complete about their do- main. One of the consequences of the weakening of the closed-world assumption is that query answering in locally closed databases is undecidable. In this paper, we develop efficient approximate methods for query an- swering, based on fixpoint computations, and investi- gate conditions that assure the optimality of these meth- ods. Our approach of approximative reasoning may be incorporated in different contexts where incompleteness plays a major role and efficient reasoning is imperative.
Standard databases convey Reiter's closed-world assumption that an atom not in the database is false. This assumption is relaxed in locally complete databases that are sound but only partially complete about their domain. One of the consequences of the weakening of the closed-world assumption is that query answering in locally closed databases is not tractable. In this paper we develop efficient approximate methods for query answering, based on fixpoint computations. We present preliminary results for a broad class of locally closed databases in which this method produces complete answers to queries.
2018
Consistent query answering is a principled approach for querying inconsistent knowledge bases. It relies on the central notion of repair, that is, a maximal consistent subset of the facts in the knowledge base. One drawback of this approach is that entire facts are deleted to resolve inconsistency, even if they may still contain useful "reliable" information. To overcome this limitation, we propose an inconsistency-tolerant semantics for query answering based on a new notion of repair, allowing values within facts to be updated for restoring consistency. This more fine-grained repair primitive allows us to preserve more information in the knowledge base. We also introduce the notion of a universal repair, which is a compact representation of all repairs and can be computed in polynomial time. Then, we show that consistent query answering in our framework is intractable (coNP-complete). In light of this result, we develop a polynomial time approximation algorithm for computing a sound (but possibly incomplete) set of consistent query answers.
Proceedings of the VLDB Endowment, 2014
A database is called uncertain if two or more tuples of the same relation are allowed to agree on their primary key. Intuitively, such tuples act as alternatives for each other. A repair (or possible world) of such uncertain database is obtained by selecting a maximal number of tuples without ever selecting two tuples of the same relation that agree on their primary key. For a Boolean query q , the problem CERTAINTY( q ) takes as input an uncertain database db and asks whether q evaluates to true on every repair of db. In recent years, the complexity of CERTAINTY( q ) has been studied under different restrictions on q . These complexity studies have assumed no restrictions on the uncertain databases that are input to CERTAINTY( q ). In practice, however, it may be known that these input databases are partially consistent, in the sense that they satisfy some dependencies (e.g., functional dependencies). In this article, we introduce the problem CERTAINTY( q ) in the presence of a set...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.