Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010
Data cleaning and Extract-Transform-Load processes are usually modeled as graphs of data transformations. These graphs typically involve a large number of data transformations, and must handle large amounts of data. The involvement of the users responsible for executing the corresponding programs over real data is important to tune data transformations and to manually correct data items that cannot be treated automatically.
2011
Data cleaning and ETL processes are usually modeled as graphs of data transformations. The involvement of the users responsible for executing these graphs over real data is important to tune data transformations and to manually correct data items that cannot be treated automatically. In this paper, in order to better support the user involvement in data cleaning processes, we equip a data cleaning graph with data quality constraints to help users identifying the points of the graph and the records that need their attention and manual data repairs for representing the way users can provide the feedback required to manually clean some data items. We provide preliminary experimental results that show the significant gains obtained with the use of data cleaning graphs.
Lecture Notes in Computer Science, 2011
Data cleaning and ETL processes are usually modeled as graphs of data transformations. The involvement of the users responsible for executing these graphs over real data is important to tune data transformations and to manually correct data items that cannot be treated automatically. In this paper, in order to better support the user involvement in data cleaning processes, we equip a data cleaning graph with data quality constraints to help users identifying the points of the graph and the records that need their attention and manual data repairs for representing the way users can provide the feedback required to manually clean some data items. We provide preliminary experimental results that show the significant gains obtained with the use of data cleaning graphs.
2001
The problem of data cleaning, which consists of removing inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. However, for some applications, existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. One important challenge with them is the design of a data flow graph that effectively generates clean data. A generalized difficulty is the lack of explanation of cleaning results and user interaction facilities to tune a data cleaning program. This paper presents a solution to handle this problem by enabling users to express user interactions declaratively and tune data cleaning programs.
ArXiv, 2017
Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process, e.g., to identify and repair errors, to validate computed repairs, etc. There is currently a plethora of data cleaning algorithms addressing a wide range of data errors (e.g., detecting duplicates, violations of integrity constraints, missing values, etc.). Many of these algorithms involve a human in the loop, however, this latter is usually coupled to the underlying cleaning algorithms. There is currently no end-to-end data cleaning framework that systematically involves humans in the cleaning pipeline regardless of the underlying cleaning algorithms. In this paper, we highlight key challenges that need to be addressed to realize such a framework. We present a design vision and discuss scenarios that motivate the need for such a framework to judiciously assist humans in the cleaning process. Finally, we present directions to implement ...
Proceedings of the Workshop on Human-In-the-Loop Data Analytics
The importance of data cleaning systems has continuously grown in recent years. Especially for real-time streaming applications, it is crucial, to identify and possibly remove anomalies in the data on the fly before further processing. The main challenge however lies in the construction of an appropriate data cleaning pipeline, which is complicated by the dynamic nature of streaming applications. To simplify this process and help data scientists to explore and understand the incoming data, we propose an interactive data cleaning system for streaming applications. In this paper, we list requirements for such a system and present our implementation to overcome the stated issues. Our demonstration shows, how a data cleaning pipeline can be interactively created, executed, and monitored at runtime. We also present several different tools, such as the automated advisor and the adaptive visualizer, that engage the user in the data cleaning process and help them understand the behavior of the pipeline. CCS CONCEPTS • Information systems → Data cleaning.
Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 2014
Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some target report generated by transformations over multiple data sources. This creates a situation where the violations detected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possible and affordable, this would be of little help towards identifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we propose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and efficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules. * Work partially done while at QCRI.
Proceedings of the Workshop on Human-In-the-Loop Data Analytics - HILDA'19
Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process such as providing rules or validating computed repairs. There is a plethora of data cleaning algorithms addressing a wide range of data errors (e.g., detecting duplicates, violations of integrity constraints, and missing values). Many of these algorithms involve a human in the loop, however, this latter is usually coupled to the underlying cleaning algorithms. In a real data cleaning pipeline, several data cleaning operations are performed using different tools. A high-level reasoning on these tools, when combined to repair the data, has the potential to unlock useful use cases to involve humans in the cleaning process. Additionally, we believe there is an opportunity to benefit from recent advances in active learning methods to minimize the effort humans have to spend to verify data items produced by tools or humans. There is currently no end-to-end data cleaning framework that systematically involves humans in the cleaning pipeline regardless of the underlying cleaning algorithms. In this paper, we present opportunities that this framework could offer, and highlight key challenges that need to be addressed to realize this vision. We present a design vision and discuss scenarios that motivate the need for this framework to judiciously assist humans in the cleaning process.
We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning.
ITEGAM- Journal of Engineering and Technology for Industrial Applications (ITEGAM-JETIA), 2020
One of the great challenges to obtaining knowledge from data sources is to ensure consistency and non-duplication of stored information. Many techniques have been proposed to minimize the work cost and to allow data to be analyzed and properly corrected. However, there are still other essential aspects for the success of data cleaning process that involve many technological areas: performance, semantic and autonomy of the process. Against this backdrop, we developed an automated configurable data cleaning environment based on training and physical-semantic data similarity, aiming to provide a more efficient and extensible tool for performing information correction which covers problems not yet explored such as semantic and autonomy of the cleaning implementation process. The developed work has, among its objectives, the reduction of user interaction in the process of analyzing and correcting data inconsistencies and duplications. With a properly calibrated environment, the efficiency is significant, covering approximately 90% of inconsistencies in the database, with a 0% percentage of false-positive cases. Approaches were also demonstrated to show that besides detecting and treating information inconsistencies and duplication of positive cases, they also addressed cases of detected false-positives and the negative impacts they may have on the data cleaning process, whether manual or automated, which is not yet widely discussed in literature. The most significant contribution of this work refers to the developed tool that, without user interaction, is automatically able to analyze and eliminate 90% of the inconsistencies and duplications of information contained in a database, with no occurrence of false-positives. The results of the tests proved the effectiveness of all the developed features, relevant to each module of the proposed architecture. In several scenarios the experiments demonstrated the effectiveness of the tool.
Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073), 2000
Data integration solutions dealing with large amounts of data have been strongly required in the last few years. Besides the traditional data integration problems e.g. schema integration, local to global schema mappings, three additional data problems have t o b e dealt with: 1 the absence of universal keys across di erent databases that is known as the object identity problem, 2 the existence of keyborad errors in the data, and 3 the presence of inconsistencies in data coming from multiple sources. Dealing with these problems is globally called the data cleaning process. In this work, we propose a framework which o ers the fundamental services required by this process: data transformation, duplicate elimination and multi-table matching. These services are implemented using a set of purposely designed macro-operators. Moreover, we propose an SQL extension for specifying each of the macro-operators. One important feature of the framework is the ability o f explicitly including the human interaction in the process. The main novelty of the work is that the framework permits the following performance optimizations which are tailored for data cleaning applications: mixed evaluation, neighborhood hash join, decision push-down and short-circuited computation. We measure the bene ts of each.
Data assessment and data cleaning tasks have traditionally been addressed through procedural solutions. Most of the time, those solutions have been applicable to specific problems and domains. In the last few years we have seen the emergence of more generic solutions; and also of declarative and rule-based specifications of the intended solutions of data cleaning processes. In this chapter we review some of those recent developments.
International Journal of Knowledge-Based Organizations, 2011
The quality of real world data that is being fed into a data warehouse is a major concern of today. As the data comes from a variety of sources before loading the data in the data warehouse, it must be checked for errors and anomalies. There may be exact duplicate records or approximate duplicate records in the source data. The presence of incorrect or inconsistent data can significantly distort the results of analyses, often negating the potential benefits of information-driven approaches. This paper addresses issues related to detection and correction of such duplicate records. Also, it analyzes data quality and various factors that degrade it. A brief analysis of existing work is discussed, pointing out its major limitations. Thus, a new framework is proposed that is an improvement over the existing technique.
International Journal of Computer Applications, 2013
International Journal of Computer Applications, 2013
Data Cleansing or (data scrubbing) is an activity involving a process of detecting and correcting the errors and inconsistencies in data warehouse. Thus poor quality data i.e.; dirty data present in a data mart can be avoided using various data cleaning strategies, and thus leading to more accurate and hence reliable decision making. The quality data can only be produced by cleaning the data and pre-processing it prior to loading it in the data warehouse.
Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, 2015
The organizations' demand to integrate several heterogeneous data sources and an ever-increasing volume of data is revealing the presence of quality problems in data. Currently, most of the data cleaning approaches (for detection and correction of data quality problems) are tailored for data sources with the same schema and sharing the same data model (e.g., relational model). On the other hand, these approaches are highly dependent on a domain expert to specify the data cleaning operations. This paper extends a previously proposed data cleaning methodology that reuses cleaning knowledge specified for other data sources. The methodology is further detailed/refined by specifying the requirements that a data cleaning operations vocabulary must satisfy. Ontologies in RDF/OWL are proposed as the data model for an abstract representation of the data schemas, no matter which data model is used (e.g., relational; graph). Existing approaches, methods and techniques that support the implementation of the proposed methodology, in general, and specifically of the data cleaning operations vocabulary are also presented and discussed in this paper.
2021
In this paper, we study the explainability of automated data cleaning pipelines and propose CLeanEX, a solution that can generate explanations for the pipelines automatically selected by an automated cleaning system, given it can provide its corresponding cleaning pipeline search space. We propose meaningful explanatory features that are used to describe the pipelines and generate predicate-based explanation rules. We compute quality indicators for these explanations and propose a multi-objective optimization algorithm to select the optimal set of explanations for user-defined objectives. Preliminary experiments show the need for multi-objective optimization for the generation of high-quality explanations that can be either intrinsic to the single selected cleaning pipeline or relative to the other pipelines that were not selected by the automated cleaning system. We also show that CLeanEX is a promising step towards generating automatically insightful explanations, while catering t...
Ijca Proceedings on National Conference on Role of Engineers in National Building, 2014
Data warehouse contains large volume of data. Data quality is an important issue in data warehousing projects. Many business decision processes are based on the data entered in the data warehouse. Hence for accurate data, improving the data quality is necessary. Data may include text errors, quantitative errors or even duplication of the data. There are several ways to remove such errors and inconsistencies from the data. Data cleaning is a process of detecting and correcting inaccurate data. Different types of algorithms such as Improved PNRS algorithm, Quantitative algorithm and Transitive algorithm are used for the data cleaning process. In this paper an attempt has been made to clean the data in the data warehouse by combining different approaches of data cleaning. Text data will be cleaned by Improved PNRS algorithm, Quantitative data will be cleaned by special rules i.e. Enhanced technique. And lastly duplication of the data will be removed by Transitive closure algorithm. By applying these algorithms one after other on data sets, the accuracy level of the dataset will get increased.
2006
There is no magic solution for data cleaning. The user has always to specify the cleaning operations to perform. A huge number of operations may have to be specified. Yet, this is the condition to detect and correct the data quality problems successfully. Most of the cleaning operations are generic enough to be applied to different databases. These operations may be limited to databases of the same domain or can be so general that are domain independent. The traditional approach to data cleaning is to specify the operations at the database schema level. Several changes are required to reuse a cleaning operation in another database. This paper presents an approach that supports the interoperability of the operations among different databases. This is achieved through an ontological level that supports the conceptual specification of the cleaning operations. This abstraction level isolates them from the schema of the databases and allows their reuse easily.
Dirty data is a serious problem for businesses, leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. A variety of integrity constraints like Conditional Functional Dependencies (CFD) have been studied for data cleaning. Data repairing methods based on these constraints are strong to detect inconsistencies but are limited on how to correct data, worse they can even introduce new errors. Based on Master Data Management principles, a new class of data quality rules known as Editing Rules (eR) tells how to fix errors, pointing which attributes are wrong and what values they should take. However, finding data quality rules is an expensive process that involves intensive manual efforts. In this paper, we develop pattern mining tech- nics for discovering eRs from existing source relations (eventually dirty) with respect to master relations (supposed to be clean and accurate). In this setting, we propose a new semantic of eRs taking...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.