Generic and Declarative Approaches to Data Quality Management

Leopoldo Bertossi; Loreto Bravo

Generic and Declarative Approaches to Data Quality Management

Abstract

Data quality assessment and data cleaning tasks have traditionally been addressed through procedural solutions. Most of the time, those solutions have been applicable to specific problems and domains. In the last few years we have seen the emergence of more generic solutions; and also of declarative and rule-based specifications of the intended solutions of data cleaning processes. In this chapter we review some of those historical and recent developments.

The use of data quality rules, which capture business rules and domain constraints, is central to most data quality processes. Poor data quality often arises when the data and these rules (which are meant to preserve data integrity) become inconsistent. To resolve inconsistencies, organizations often implement specific, sometimes manual, sometimes computer-aided, cleansing routines to fix the errors. This solution necessitates frequent repetition of the data cleaning to resolve inconsistencies continually as the data evolves or grows. It is important to recognize that modern organizations may be as dynamic as their data. The business rules, application domain constraints, and data semantics will evolve. As business policies change, as data is integrated with new sources, and as the underlying data evolves, it becomes necessary to manage, evolve, and repair both the data and the rules. In this work, we present a new data quality paradigm that uses automated support for both data and data quality rule repair and management. Previous techniques have focused mostly on updating or correcting the data. In contrast, our approach looks for clues in the data to understand if the data semantics or rules may have evolved. The approach is a holistic one that is designed to facilitate the continuous curation and maintenance of both data and data quality rules. A unique feature of our approach is that we use data mining to discover trends, contextual information, and data patterns that may yield meaningful insights into how a business rule has evolved. Our approach is designed to consider the very wide (many attribute) entity types or tables that are managed by many organizations. We recognize that due to acquisitions or business evolution, data quality rules may need to be evolved to account for many new features (attributes) of the data. We conduct two case studies using real business datasets that demonstrate the quality and usefulness of our methods in a continuous data quality process that manages and evolves both data and data quality rules. The evaluation provides promising results that show how a business analyst can use our tool to quickly identify errors, and identify if the errors are due to dirty data or to erroneous data quality rules that may need to be evolved. This understanding results in both improved overall data quality, and improved rule quality for better maintenance of new data.

Log In

Generic and Declarative Approaches to Data Quality Management

Sign up for access to the world's latest research

Abstract

Related papers

Related topics