Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Proceedings of the Workshop on Human-In-the-Loop Data Analytics
The importance of data cleaning systems has continuously grown in recent years. Especially for real-time streaming applications, it is crucial, to identify and possibly remove anomalies in the data on the fly before further processing. The main challenge however lies in the construction of an appropriate data cleaning pipeline, which is complicated by the dynamic nature of streaming applications. To simplify this process and help data scientists to explore and understand the incoming data, we propose an interactive data cleaning system for streaming applications. In this paper, we list requirements for such a system and present our implementation to overcome the stated issues. Our demonstration shows, how a data cleaning pipeline can be interactively created, executed, and monitored at runtime. We also present several different tools, such as the automated advisor and the adaptive visualizer, that engage the user in the data cleaning process and help them understand the behavior of the pipeline. CCS CONCEPTS • Information systems → Data cleaning.
2010
Data cleaning and Extract-Transform-Load processes are usually modeled as graphs of data transformations. These graphs typically involve a large number of data transformations, and must handle large amounts of data. The involvement of the users responsible for executing the corresponding programs over real data is important to tune data transformations and to manually correct data items that cannot be treated automatically.
Lecture Notes in Computer Science, 2011
Data cleaning and ETL processes are usually modeled as graphs of data transformations. The involvement of the users responsible for executing these graphs over real data is important to tune data transformations and to manually correct data items that cannot be treated automatically. In this paper, in order to better support the user involvement in data cleaning processes, we equip a data cleaning graph with data quality constraints to help users identifying the points of the graph and the records that need their attention and manual data repairs for representing the way users can provide the feedback required to manually clean some data items. We provide preliminary experimental results that show the significant gains obtained with the use of data cleaning graphs.
2011
Data cleaning and ETL processes are usually modeled as graphs of data transformations. The involvement of the users responsible for executing these graphs over real data is important to tune data transformations and to manually correct data items that cannot be treated automatically. In this paper, in order to better support the user involvement in data cleaning processes, we equip a data cleaning graph with data quality constraints to help users identifying the points of the graph and the records that need their attention and manual data repairs for representing the way users can provide the feedback required to manually clean some data items. We provide preliminary experimental results that show the significant gains obtained with the use of data cleaning graphs.
2001
The problem of data cleaning, which consists of removing inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. However, for some applications, existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. One important challenge with them is the design of a data flow graph that effectively generates clean data. A generalized difficulty is the lack of explanation of cleaning results and user interaction facilities to tune a data cleaning program. This paper presents a solution to handle this problem by enabling users to express user interactions declaratively and tune data cleaning programs.
ArXiv, 2017
Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process, e.g., to identify and repair errors, to validate computed repairs, etc. There is currently a plethora of data cleaning algorithms addressing a wide range of data errors (e.g., detecting duplicates, violations of integrity constraints, missing values, etc.). Many of these algorithms involve a human in the loop, however, this latter is usually coupled to the underlying cleaning algorithms. There is currently no end-to-end data cleaning framework that systematically involves humans in the cleaning pipeline regardless of the underlying cleaning algorithms. In this paper, we highlight key challenges that need to be addressed to realize such a framework. We present a design vision and discuss scenarios that motivate the need for such a framework to judiciously assist humans in the cleaning process. Finally, we present directions to implement ...
ITEGAM- Journal of Engineering and Technology for Industrial Applications (ITEGAM-JETIA), 2020
One of the great challenges to obtaining knowledge from data sources is to ensure consistency and non-duplication of stored information. Many techniques have been proposed to minimize the work cost and to allow data to be analyzed and properly corrected. However, there are still other essential aspects for the success of data cleaning process that involve many technological areas: performance, semantic and autonomy of the process. Against this backdrop, we developed an automated configurable data cleaning environment based on training and physical-semantic data similarity, aiming to provide a more efficient and extensible tool for performing information correction which covers problems not yet explored such as semantic and autonomy of the cleaning implementation process. The developed work has, among its objectives, the reduction of user interaction in the process of analyzing and correcting data inconsistencies and duplications. With a properly calibrated environment, the efficiency is significant, covering approximately 90% of inconsistencies in the database, with a 0% percentage of false-positive cases. Approaches were also demonstrated to show that besides detecting and treating information inconsistencies and duplication of positive cases, they also addressed cases of detected false-positives and the negative impacts they may have on the data cleaning process, whether manual or automated, which is not yet widely discussed in literature. The most significant contribution of this work refers to the developed tool that, without user interaction, is automatically able to analyze and eliminate 90% of the inconsistencies and duplications of information contained in a database, with no occurrence of false-positives. The results of the tests proved the effectiveness of all the developed features, relevant to each module of the proposed architecture. In several scenarios the experiments demonstrated the effectiveness of the tool.
2021
In this paper, we study the explainability of automated data cleaning pipelines and propose CLeanEX, a solution that can generate explanations for the pipelines automatically selected by an automated cleaning system, given it can provide its corresponding cleaning pipeline search space. We propose meaningful explanatory features that are used to describe the pipelines and generate predicate-based explanation rules. We compute quality indicators for these explanations and propose a multi-objective optimization algorithm to select the optimal set of explanations for user-defined objectives. Preliminary experiments show the need for multi-objective optimization for the generation of high-quality explanations that can be either intrinsic to the single selected cleaning pipeline or relative to the other pipelines that were not selected by the automated cleaning system. We also show that CLeanEX is a promising step towards generating automatically insightful explanations, while catering t...
2010
It is very difficult to over-emphasize the benefits of accurate data. Errors in data are generally the most expensive aspect of data entry, costing the users even much more compared to the original data entry. Unfortunately, these costs are intangibles or difficult to measure. If errors are detected at an early stage then it requires little cost to remove the errors. Incorrect and misleading data lead to all sorts of unpleasant and unnecessary expenses. Unluckily, it would be very expensive to correct the errors after the data has been processed, particularly when the processed data has been converted into the knowledge for decision making. No doubt a stitch in time saves nine i.e. a timely effort will prevent more work at later stage. Moreover, time spent in processing errors can also have a significant cost. One of the major problems with automated data entry systems are errors. In this paper we discuss many well known techniques to minimize errors, different cleansing approaches ...
Proceedings of the Workshop on Human-In-the-Loop Data Analytics - HILDA'19
Data Cleaning refers to the process of detecting and fixing errors in the data. Human involvement is instrumental at several stages of this process such as providing rules or validating computed repairs. There is a plethora of data cleaning algorithms addressing a wide range of data errors (e.g., detecting duplicates, violations of integrity constraints, and missing values). Many of these algorithms involve a human in the loop, however, this latter is usually coupled to the underlying cleaning algorithms. In a real data cleaning pipeline, several data cleaning operations are performed using different tools. A high-level reasoning on these tools, when combined to repair the data, has the potential to unlock useful use cases to involve humans in the cleaning process. Additionally, we believe there is an opportunity to benefit from recent advances in active learning methods to minimize the effort humans have to spend to verify data items produced by tools or humans. There is currently no end-to-end data cleaning framework that systematically involves humans in the cleaning pipeline regardless of the underlying cleaning algorithms. In this paper, we present opportunities that this framework could offer, and highlight key challenges that need to be addressed to realize this vision. We present a design vision and discuss scenarios that motivate the need for this framework to judiciously assist humans in the cleaning process.
Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 2014
Data cleaning techniques usually rely on some quality rules to identify violating tuples, and then fix these violations using some repair algorithms. Oftentimes, the rules, which are related to the business logic, can only be defined on some target report generated by transformations over multiple data sources. This creates a situation where the violations detected in the report are decoupled in space and time from the actual source of errors. In addition, applying the repair on the report would need to be repeated whenever the data sources change. Finally, even if repairing the report is possible and affordable, this would be of little help towards identifying and analyzing the actual sources of errors for future prevention of violations at the target. In this paper, we propose a system to address this decoupling. The system takes quality rules defined over the output of a transformation and computes explanations of the errors seen on the output. This is performed both at the target level to describe these errors and at the source level to prescribe actions to solve them. We present scalable techniques to detect, propagate, and explain errors. We also study the effectiveness and efficiency of our techniques using the TPC-H Benchmark for different scenarios and classes of quality rules. * Work partially done while at QCRI.
International Journal of Computer Applications, 2013
Data Cleansing or (data scrubbing) is an activity involving a process of detecting and correcting the errors and inconsistencies in data warehouse. Thus poor quality data i.e.; dirty data present in a data mart can be avoided using various data cleaning strategies, and thus leading to more accurate and hence reliable decision making. The quality data can only be produced by cleaning the data and pre-processing it prior to loading it in the data warehouse.
International Journal of Computer Applications, 2013
International Journal of Knowledge-Based Organizations, 2011
The quality of real world data that is being fed into a data warehouse is a major concern of today. As the data comes from a variety of sources before loading the data in the data warehouse, it must be checked for errors and anomalies. There may be exact duplicate records or approximate duplicate records in the source data. The presence of incorrect or inconsistent data can significantly distort the results of analyses, often negating the potential benefits of information-driven approaches. This paper addresses issues related to detection and correction of such duplicate records. Also, it analyzes data quality and various factors that degrade it. A brief analysis of existing work is discussed, pointing out its major limitations. Thus, a new framework is proposed that is an improvement over the existing technique.
2005
This chapter analyzes the problem of data cleansing and the identification of potential errors in data sets. The differing views of data cleansing are surveyed and reviewed and a brief overview of existing data cleansing tools is given. A general framework of the data cleansing process is presented as well as a set of general methods that can be used to address the problem. The applicable methods include statistical outlier detection, pattern matching, clustering, and Data Mining techniques.
Data assessment and data cleaning tasks have traditionally been addressed through procedural solutions. Most of the time, those solutions have been applicable to specific problems and domains. In the last few years we have seen the emergence of more generic solutions; and also of declarative and rule-based specifications of the intended solutions of data cleaning processes. In this chapter we review some of those recent developments.
AI and Data Science , 2024
Data cleansing is an important prerequisite for reliable data evaluation and system-gaining knowledge of programs, directly impacting the exceptional insights and model overall performance. This paper provides a comprehensive examination of modern information-cleaning methodologies, focusing on their sensible packages and effectiveness across diverse datasets. We analyze six primary categories of information cleansing techniques: missing statistics management, outlier detection, information standardization, reproduction removal, consistency validation, and data type transformations. Our systematic assessment exhibits that automated information cleansing pipelines, at the same time efficient, require careful configuration primarily based on area context. Key findings imply that hybrid procedures-combining statistical techniques with area-precise policies-attain advanced consequences compared to standalone strategies, showing a 23% improvement in statistically satisfactory metrics. We also perceive that early-stage records validation significantly reduces downstream processing mistakes by 45%. The implications of this research suggest that companies ought to implement iterative facts-cleaning workflows, incorporating continuous validation and area expert remarks. Furthermore, our findings emphasize the significance of documenting cleansing decisions to ensure reproducibility and keep information lineage. This work offers a structured framework for practitioners to select and implement suitable facts-cleaning techniques based totally on their particular use cases and facts characteristics.
In this information era, there is a huge availability of data but the information is not enough to meet the requirements. This creates an urgent need for data cleaning and data cleaning solutions become highly important for data mining users. Normally, data cleaning deals with detecting, eliminating errors and inconsistencies in large data sets. For any real world data set, doing this task manually is very cumbersome as it involves huge amount of human resource and time. This means several organizations spend millions of dollars per year to detect data errors. Due to this wide range of possible data inconsistencies and the sheer data volume, data cleaning is considered one of the biggest problems in data warehousing. Normally the data cleaning is required when multiple data sources need to be integrated. In this research work, an Enhanced Common Data Cleaning (ECDC) framework has been developed and proposed.
We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning.
Ijca Proceedings on National Conference on Role of Engineers in National Building, 2014
Data warehouse contains large volume of data. Data quality is an important issue in data warehousing projects. Many business decision processes are based on the data entered in the data warehouse. Hence for accurate data, improving the data quality is necessary. Data may include text errors, quantitative errors or even duplication of the data. There are several ways to remove such errors and inconsistencies from the data. Data cleaning is a process of detecting and correcting inaccurate data. Different types of algorithms such as Improved PNRS algorithm, Quantitative algorithm and Transitive algorithm are used for the data cleaning process. In this paper an attempt has been made to clean the data in the data warehouse by combining different approaches of data cleaning. Text data will be cleaned by Improved PNRS algorithm, Quantitative data will be cleaned by special rules i.e. Enhanced technique. And lastly duplication of the data will be removed by Transitive closure algorithm. By applying these algorithms one after other on data sets, the accuracy level of the dataset will get increased.
Vicious Print Media, Nairobi,Kenya, 2020
Over time and technology development, the concept of data science, its processes, aims, and implementation use cases change. According to a journal by 365 data science, data science was relating to gathering and cleaning data sets about two decades ago and then applying computational methods to those results. However, over time, data science has developed into an area that encompasses data processing, statistical modeling, data mining, artificial intelligence, deep learning, and more (365 Data science, 2018). I.B.M. describes data science as the "technique and understanding of information mining, planning, visualization, interpretation, and maintenance. It is the art of extracting information and observations from organized and unstructured data utilizing algorithms, techniques, and structures" (I.B.M., 2019, n.p). Then, it utilizes automation and machine learning to support consumers in forecasting, optimizing performance, and enhancing processes and decision-making. Data is a business requirement for every enterprise, and hence data science has a wide range of applications. Within this paper, we will address some of the essential uses of data science and examine how it influences today's world industries. In the end, we shall investigate the various methods of data processing, cleaning, and modeling techniques used to make data meaningful.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.