Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2012, Proceedings of the 21st ACM international conference on Information and knowledge management - CIKM '12
Data completeness is an important aspect of data quality. We consider a setting, where databases can be incomplete in two ways: records may be missing and records may contain null values. We (i) formalize when the answer set of a query is complete in spite of such incompleteness, and (ii) we introduce table completeness statements, by which one can express that certain parts of a database are complete. We then study how to deduce from a set of tablecompleteness statements that a query can be answered completely. Null values as used in SQL are ambiguous. They can indicate either that no attribute value exists or that a value exists, but is unknown. We study completeness reasoning for the different interpretations. We show that in the combined case it is necessary to syntactically distinguish between different kinds of null values and present an encoding for doing that in standard SQL databases. With this technique, any SQL DBMS evaluates complete queries correctly with respect to the different meanings that nulls can carry. We study the complexity of completeness reasoning and provide algorithms that in most cases agree with the worst-case lower bounds.
Lecture Notes in Computer Science, 2012
Data completeness is an essential aspect of data quality as in many scenarios it is crucial to guarantee the completeness of query answers. Data might be incomplete in two ways: records may be missing as a whole, or attribute values of a record may be absent, indicated by a null. We extend previous work by two of the authors [10] that dealt only with the first aspect, to cover both missing records and missing attribute values. To this end, we refine the formalization of incomplete databases and identify the important special case where values of key attributes are always known. We show that in the presence of nulls, completeness of queries can be defined in several ways. We also generalize a previous approach stating completeness of parts of a database, using so-called table completeness statements. With this formalization in place, we define the main inferences for completeness reasoning over incomplete databases and present first results.
ACM SIGMOD Record, 1988
Reiter has proposed extended relational theory to formulate relational databases with null values and presented a query evaluation algorithm for such databases. However, due to indefinite information brought in by null values, Reiter's algorithm is sound but not complete. In this paper, we first propose an extended relation to represent indefinite information in relational databases. Then, we define an extended relational algebra for extended relations. Based on Reiter's extended relational theory, and our extended relations and the extended relational algebra, we present a sound and complete query evaluation algorithm for relational databases with null values
The assumption that a database includes a representation ol every occurrence in the real world environmrnl that it models (the Closed World Asscrtnplio?l) is frequently unrealistic, because it is always made on the database as a whole. This paper introduces a new type of dntab,ase information, called completeness inlormnlion, lo dcscrihe the subsets of the database for which this assumption is correct. With completeness information it is possible to determine whether each ansivcr to a user query is complete, or whether any subsets of it are complete. To users, answers which are accompanied by a statement about their completeness are more mraningful. First, the principles of completeness informn.lion are defined formally, using an abstract data model. Then, specific methods are described for implcmrnting completeness information in the relational modr4. With these methods, each relational algebra query can be n.ccompnnietl wi(.h a.n instantaneous verdict on its coml)letcness (or on the completcncss of some of its subsets).
2010
The theoretical study of the relational model of data is ongoing and highly developed. Yet the vast majority of real databases include incomplete data, and the incomplete data is widely modelled using special flags called nulls. As noted many times by Date and others, the inclusion of nulls is not compatible with the relational model and invalidates many of the theoretical results as well as requiring a three-valued logic for query support. In category theoretic applications to computer science, partial functions are frequently modelled by using a special value approach (the partial map classifier), or by explicit reference to the domain of definition subobject. In a former edition of the CATS conference the first author and his colleague Rosebrugh proved a Morita equivalence theorem showing that for database modelling the two approaches are equivalent, provided the domain of definition subobject is complemented. In this paper we study the uncomplemented domain of definition approac...
Information Processing Letters, 1986
When considering using databases to represent incomplete information, the relationship between two facts where one may imply the other needs to be addressed. In relational databases, this question becomes whether null completion is assumed. That is, does a (possibly partially-defined) tuple imply the existence of tuples that are' less informative' than the original tuple. We show that no relational algebra, that assumes equivalence under null completion, can include set-theoretic operators that are compatible with ordinary set theory. Thus, the approach of x-relations is incompatible with the axioms of a boolean algebra.
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015
In many applications including loosely coupled cloud databases, collaborative editing and network monitoring, data from multiple sources is regularly used for query answering. For reasons such as system failures, insufficient author knowledge or network issues, data may be temporarily unavailable or generally nonexistent. Hence, not all data needed for query answering may be available. In this paper, we propose a natural class of completeness patterns, expressed by selections on database tables, to specify complete parts of database tables. We then show how to adapt the operators of relational algebra so that they manipulate these completeness patterns to compute completeness patterns pertaining to query answers. Our proposed algebra is computationally sound and complete with respect to the information that the patterns provide. We show that stronger completeness patterns can be obtained by considering not only the schema but also the database instance and we extend the algebra to take into account this additional information. We develop novel techniques to efficiently implement the computation of completeness patterns on query answers and demonstrate their scalability on real data.
Information Processing & Management, 1988
This article discusses a query processor to deal with incomplete information in a database. We suggest using the relaxed database, which is an abstraction from the original database, as a basis for a front-end query processor. The purposes of the relaxed database are twofold: first, to restrict the number of the objects to be processed in a query, and second, to aid the interpretation of a query.
ACM Transactions on Database Systems, 2014
The term naïve evaluation refers to evaluating queries over incomplete databases as if nulls were usual data values, i.e., to using the standard database query evaluation engine. Since the semantics of query answering over incomplete databases is that of certain answers, we would like to know when naïve evaluation computes them: i.e., when certain answers can be found without inventing new specialized algorithms. For relational databases it is well known that unions of conjunctive queries possess this desirable property, and results on preservation of formulae under homomorphisms tell us that within relational calculus, this class cannot be extended under the open-world assumption. Our goal here is twofold. First, we develop a general framework that allows us to determine, for a given semantics of incompleteness, classes of queries for which naïve evaluation computes certain answers. Second, we apply this approach to a variety of semantics, showing that for many classes of queries beyond unions of conjunctive queries, naïve evaluation makes perfect sense under assumptions different from open-world. Our key observations are: (1) naïve evaluation is equivalent to monotonicity of queries with respect to a semantics-induced ordering, and (2) for most reasonable semantics of incompleteness, such monotonicity is captured by preservation under various types of homomorphisms. Using these results we find classes of queries for which naïve evaluation works, e.g., positive first-order formulae for the closed-world semantics. Even more, we introduce a general relation-based framework for defining semantics of incompleteness, show how it can be used to capture many known semantics and to introduce new ones, and describe classes of first-order queries for which naïve evaluation works under such semantics.
Database TheoryICDT 2007, 2006
We study containment of conjunctive queries that are evaluated over databases that may contain tuples with null values. We assume the semantics of SQL for single block queries with a SELECT DISTINCT clause. This problem ("null containment" for short) is different from containment over databases without null values and sometimes more difficult. We show that null-containment for boolean conjunctive queries is NPcomplete while it is Π P 2 -complete for queries with distinguished variables. However, if no relation symbol is allowed to appear more than twice, then null-containment is polynomial, as it is for databases without nulls. If we add a unary test predicate IS NULL, as it is available in SQL, then containment becomes Π P 2 -hard for boolean queries, while it remains in Π P 2 for arbitrary queries.
Proceedings of the VLDB Endowment, 2011
I am also thankful to Zeno Moriggl and Martin Prosch from the school IT department of the province of South Tyrol who initiated the research collaboration that led to my thesis and who invested their time to give me an understanding of their practical problems. Many thanks go to Franz Baader, who immediately agreed to take supervision from Dresden and made it possible for me to write my thesis in Bozen. Thanks to all people from the KRDB group in Bozen who welcomed me in their group and provided a friendly and productive atmosphere for working. Without my teacher and organizer of the Erasmus program, Uwe Petersohn, I might have never come to Bozen, thank you. Finally, thank you to my family for everything. 10
arXiv (Cornell University), 2023
To answer database queries over incomplete data the gold standard is finding certain answers: those that are true regardless of how incomplete data is interpreted. Such answers can be found efficiently for conjunctive queries and their unions, even in the presence of constraints. With negation added, the problem becomes intractable however. We concentrate on the complexity of certain answers under constraints, and on effficiently answering queries outside the usual classes of (unions) of conjunctive queries by means of rewriting as Datalog and first-order queries. We first notice that there are three different ways in which query answering can be cast as a decision problem. We complete the existing picture and provide precise complexity bounds on all versions of the decision problem, for certain and best answers. We then study a well-behaved class of queries that extends unions of conjunctive queries with a mild form of negation. We show that for them, certain answers can be expressed in Datalog with negation, even in the presence of functional dependencies, thus making them tractable in data complexity. We show that in general Datalog cannot be replaced by first-order logic, but without constraints such a rewriting can be done in first-order. The paper is under consideration in Theory and Practice of Logic Programming (TPLP).
Proceedings of the 22nd International Database Engineering & Applications Symposium, 2018
Incomplete information arises in many database applications, such as data integration, data exchange, inconsistency management, data cleaning, ontological reasoning, and many others. A principled way of answering queries over incomplete databases is to compute certain answers, which are query answers that can be obtained from every complete database represented by an incomplete one. For databases containing (labeled) nulls, certain answers to positive queries can be easily computed in polynomial time, but for more general queries with negation the problem becomes coNP-hard. To make query answering feasible in practice, one might resort to SQL's evaluation, but unfortunately, the way SQL behaves in the presence of nulls may result in wrong answers. Thus, on the one hand, SQL's evaluation is efficient but flawed, on the other hand, certain answers are a principled semantics but with high complexity. To deal with issue, recent research has focused on developing polynomial time ...
Proceedings of the VLDB Endowment, 2013
We present a system that computes for a query that may be incomplete, complete approximations from above and from below. We assume a setting where queries are posed over a partially complete database, that is, a database that is generally incomplete, but is known to contain complete information about specific aspects of its application domain. Which parts are complete, is described by a set of so-called table-completeness statements. Previous work led to a theoretical framework and an implementation that allowed one to determine whether in such a scenario a given conjunctive query is guaranteed to return a complete set of answers or not. With the present demonstrator we show how to reformulate the original query in such a way that answers are guaranteed to be complete. If there exists a more general complete query, there is a unique most specific one, which we find. If there exists a more specific complete query, there may even be infinitely many. In this case, we find the least specific specializations whose size is bounded by a threshold provided by the user. Generalizations are computed by a fixpoint iteration, employing an answer set programming engine. Specializations are found leveraging unification from logic programming.
2021
In this paper we address the problem of handling inconsistencies in tables with missing values (also called nulls) and functional dependencies. Although the traditional view is that table instances must respect all functional dependencies imposed on them, it is nevertheless relevant to develop theories about how to handle instances that violate some dependencies. Regarding missing values, we make no assumptions on their existence: a missing value exists only if it is inferred from the functional dependencies of the table. We propose a formal framework in which each tuple of a table is associated with a truth value among the following: true, false, inconsistent or unknown; and we show that our framework can be used to study important problems such as consistent query answering, table merging, and data quality measures - to mention just a few. In this paper, however, we focus mainly on consistent query answering, a problem that has received considerable attention during the last decad...
arXiv (Cornell University), 2023
In this paper, we study consistent query answering in tables with nulls and functional dependencies. Given such a table T , we consider the set T of all tuples that can be built up from constants appearing in T ; and we use set theoretic semantics for tuples and functional dependencies to characterize the tuples of T in two orthogonal ways: first as true or false tuples; and then as consistent or inconsistent tuples. Queries are issued against T and evaluated in T . In this setting, we consider a query Q: select X from T where Condition over T and define its consistent answer to be the set of tuples x in T such that: (a) x is a true and consistent tuple with schema X and (b) there exists a true super-tuple t of x in T satisfying the condition. We show that, depending on the 'status' that the super-tuple t has in T , there are different types of consistent answer to Q. The main contributions of the paper are: (a) a novel approach to consistent query answering not using table repairs; (b) polynomial algorithms for computing the sets of true/false tuples and the sets of consistent/inconsistent tuples of T ; (c) polynomial algorithms in the size of T for computing different types of consistent answer for both conjunctive and disjunctive queries; and (d) a detailed discussion of the differences between our approach and the approaches using table repairs.
Information Systems, 2019
Certain answers are a widely accepted semantics of query answering over incomplete databases. As their computation is a coNP-hard problem, recent research has focused on developing (polynomial time) evaluation algorithms with correctness guarantees, that is, techniques computing a sound but possibly incomplete set of certain answers. The aim is to make the computation of certain answers feasible in practice, settling for under-approximations. In this paper, we present novel evaluation algorithms with correctness guarantees, which provide better approximations than current techniques, while retaining polynomial time data complexity. The central tools of our approach are conditional tables and the conditional evaluation of queries. We propose different strategies to evaluate conditions, leading to different approximation algorithms-more accurate evaluation strategies have higher running times, but they pay off with more certain answers being returned. Thus, our approach offers a suite of approximation algorithms enabling users to choose the technique that best meets their needs in terms of balance between efficiency and quality of the results.
Semistructured data occur in situations where information lacks a homogeneous structure and is incomplete. Yet, up to now the incompleteness of information has not been re ected by special features of query languages. Our goal is to investigate the principles of queries that allow for incomplete answers. We do not present, however, a concrete query language. Queries over classical structured data models contain a numb e r o f v ariables and constraints on these variables. An answer is a binding of the variables by elements of the database such that the constraints are satis ed. In the present paper, we loosen this concept in so far as we allow also answers that are partial, that is, not all variables in the query are bound by such an answer. Partial answers make it necessary to re ne the model of query evaluation. The rst modi cation relates to the satisfaction of constraints: under some circumstances we consider constraints involving unbound variables as satis ed. Second, in order to prevent a proliferation of answers, we only accept answers that are maximal in the sense that there are no assignments that bind more variables and satisfy the constraints of the query. Our model of query evaluation consists of two phases, a search phase and a lter phase. Semistructured databases are essentially labeled directed graphs. In the search phase, we use a query graph containing variables to match a maximal portion of the database graph. We i n v estigate three di erent semantics for query graphs, which give rise to three variants of matching. For each v ariant, we provide algorithms and complexity results. In the lter phase, the maximal matchings resulting from the search phase are subjected to constraints, which may b e weak or strong. Strong constraints require all their variables to be bound, while weak constraints do not. We describe a polynomial algorithm for evaluating a special type of queries with lter constraints, and assess the complexity o f e v aluating other queries for several kinds of constraints. In the nal part, we i n v estigate the containment problem for queries consisting only of search constraints under the di erent semantics.
2007
Abstract MayBMS [4, 1, 3, 2] is a data management system for incomplete information developed at Saarland University. Its main features are a simple and compact representation system for incomplete information and a language called I-SQL with explicit operations for handling uncertainty. MayBMS is currently an extension of PostgreSQL and manages both complete and incomplete data and evaluates I-SQL queries.
1999
Semistructured data occur in situations where information lacks a homogeneous structure and is incomplete. Yet, up to now the incompleteness of information has not been re ected by special features of query languages for semistructured data. Our goal is to investigate the principles of queries that allow for incomplete answers. We do not present, however, a concrete query language.
Data Mining VII: Data, Text and Web Mining and their Business Applications
The concept and semantics of null values in relational databases has been discussed widely since the introduction of the relational data model in the late 1960s. With the introduction of highly mobile, distributed databases, in order to preserve the accepted soundness and completeness criteria, the semantics of the null value needs to expand to reflect a localised lack of information that may not be apparent for the global database. This paper discusses an extension to the notion of nulls to include the semantics of 'local' nulls. The paper introduces local nulls in terms of amendments to the relational algebra and examines its impact on query languages. 1 Previous research and motivation Much of the research on the semantics of null values in relational databases dates back to the 1970s and 1980s [1-7]. The two definitions of nulls as given by Codd are missing and applicable, and missing and inapplicable [1] and Zaniolo [4] later proposed a third definition as, essentially, a lack of knowledge about the attribute's applicability, or no information. To handle null values, various logical approaches have been developed. For example, the commonly-used three value logic includes true, false (often by virtue of a value's absence-q.v. the closed-world assumption [8]), and a maybe value which indicates that the results may be true [9, 10]. A four value logic has also been proposed which includes an additional truth value, which represents the outcome of evaluating expressions which have inapplicable values [11, 12]. Approaches to accommodating null values in practical systems include the work of Motro [13] who uses the ideas of conceptual closeness fill the vacancies represented by a null value and Roth et al. who aim to include nulls in NF 2 databases [5]. Null values have also been studied in relation to schema evolution and integration [14, 15].
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.