Data Cleaning

Data cleaning is a critical issue in data warehousing, particularly during the ETL process, where data inconsistencies and high volumes pose significant challenges. The document discusses the need for effective data cleaning approaches that address both schema and instance-level problems, highlighting the limitations of existing tools and the necessity for a more uniform treatment of data quality issues. It also classifies various data quality problems and emphasizes the importance of integrating data cleaning with schema transformations to improve overall data quality.

Uploaded by

kar407538

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views2 pages

Data Cleaning

Uploaded by

kar407538

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

data inconsistencies and the sheer data volume, data cleaning is considered to be one of the biggest problems

in data warehousing. During the so-called ETL process (extraction, transformation, loading), illustrated in
Fig. 1, further data transformations deal with schema/data translation and integration, and with filtering and
aggregating data to be stored in the warehouse. As indicated in Fig. 1, all data cleaning is typically
performed in a separate data staging area before loading the transformed data into the warehouse. A large
number of tools of varying functionality is available to support these tasks, but often a significant portion of
the cleaning and transformation work has to be done manually or by low-level programs that are difficult to
write and maintain.
Federated database systems and web-based information systems face data transformation steps similar to
those of data warehouses. In particular, there is typically a wrapper per data source for extraction and a
mediator for integration [32][31]. So far, these systems provide only limited support for data cleaning,
focusing instead on data transformations for schema translation and schema integration. Data is not
preintegrated as for data warehouses but needs to be extracted from multiple sources, transformed and
combined during query runtime. The corresponding communication and processing delays can be significant,
making it difficult to achieve acceptable response times. The effort needed for data cleaning during
extraction and integration will further increase response times but is mandatory to achieve useful query
results.
A data cleaning approach should satisfy several requirements. First of all, it should detect and remove all
major errors and inconsistencies both in individual data sources and when integrating multiple sources. The
approach should be supported by tools to limit manual inspection and programming effort and be extensible
to easily cover additional sources. Furthermore, data cleaning should not be performed in isolation but
together with schema-related data transformations based on comprehensive metadata. Mapping functions for
data cleaning and other data transformations should be specified in a declarative way and be reusable for
other data sources as well as for query processing. Especially for data warehouses, a workflow infrastructure
should be supported to execute all data transformation steps for multiple sources and large data sets in a
reliable and efficient way.
While a huge body of research deals with schema translation and schema integration, data cleaning has
received only little attention in the research community. A number of authors focussed on the problem of
duplicate identification and elimination, e.g., [11][12][15][19][22][23]. Some research groups concentrate on
general problems not limited but relevant to data cleaning, such as special data mining approaches [30][29],
and data transformations based on schema matching [1][21]. More recently, several research efforts propose
and investigate a more comprehensive and uniform treatment of data cleaning covering several
transformation phases, specific operators and their implementation [11][19][25].
In this paper we provide an overview of the problems to be addressed by data cleaning and their solution. In
the next section we present a classification of the problems. Section 3 discusses the main cleaning
approaches used in available tools and the research literature. Section 4 gives an overview of commercial
tools for data cleaning, including ETL tools. Section 5 is the conclusion.

2 Data cleaning problems

This section classifies the major data quality problems to be solved by data cleaning and data transformation.
As we will see, these problems are closely related and should thus be treated in a uniform way. Data
transformations [26] are needed to support any changes in the structure, representation or content of data.
These transformations become necessary in many situations, e.g., to deal with schema evolution, migrating a
legacy system to a new information system, or when multiple data sources are to be integrated.
As shown in Fig. 2 we roughly distinguish between single-source and multi-source problems and between
schema- and instance-related problems. Schema-level problems of course are also reflected in the instances;
they can be addressed at the schema level by an improved schema design (schema evolution), schema
translation and schema integration. Instance-level problems, on the other hand, refer to errors and
inconsistencies in the actual data contents which are not visible at the schema level. They are the primary
focus of data cleaning. Fig. 2 also indicates some typical problems for the various cases. While not shown in
Fig. 2, the single-source problems occur (with increased likelihood) in the multi-source case, too, besides
specific multi-source problems.

2
Data Quality Problems

Single-Source Problems Multi-Source Problems

Schema Level Instance Level Schema Level Instance Level

(Lack of integrity (Data entry errors) (Heterogeneous (Overlapping,
constraints, poor data models and contradicting and
schema design) schema designs) inconsistent data)

- Uniqueness - Misspellings - Naming conflicts - Inconsistent aggregating

- Referential integrity - Redundancy/duplicates - Structural conflicts - Inconsistent timing
… - Contradictory values … …
…

Figure 2. Classification of data quality problems in data sources

2.1 Single-source problems

The data quality of a source largely depends on the degree to which it is governed by schema and integrity
constraints controlling permissable data values. For sources without schema, such as files, there are few
restrictions on what data can be entered and stored, giving rise to a high probability of errors and
inconsistencies. Database systems, on the other hand, enforce restrictions of a specific data model (e.g., the
relational approach requires simple attribute values, referential integrity, etc.) as well as application-specific
integrity constraints. Schema-related data quality problems thus occur because of the lack of appropriate
model-specific or application-specific integrity constraints, e.g., due to data model limitations or poor
schema design, or because only a few integrity constraints were defined to limit the overhead for integrity
control. Instance-specific problems relate to errors and inconsistencies that cannot be prevented at the
schema level (e.g., misspellings).
Scope/Problem Dirty Data Reasons/Remarks
Attribute Illegal values bdate=30.13.70 values outside of domain range
Record Violated attribute age=22, bdate=12.02.70 age = (current date – birth date)
dependencies should hold
Record Uniqueness emp1=(name=”John Smith”, SSN=”123456”) uniqueness for SSN (social security
type violation emp2=(name=”Peter Miller”, SSN=”123456”) number) violated
Source Referential emp=(name=”John Smith”, deptno=127) referenced department (127) not defined
integrity violation
Table 1. Examples for single-source problems at schema level (violated integrity constraints)
For both schema- and instance-level problems we can differentiate different problem scopes: attribute (field),
record, record type and source; examples for the various cases are shown in Tables 1 and 2. Note that
uniqueness constraints specified at the schema level do not prevent duplicated instances, e.g., if information
on the same real world entity is entered twice with different attribute values (see example in Table 2).
Scope/Problem Dirty Data Reasons/Remarks
Attribute Missing values phone=9999-999999 unavailable values during data entry
(dummy values or null)
Misspellings city=”Liipzig” usually typos, phonetic errors
Cryptic values, experience=”B”;
Abbreviations occupation=”DB Prog.”
Embedded values name=”J. Smith 12.02.70 New York” multiple values entered in one attribute
(e.g. in a free-form field)
Misfielded values city=”Germany”
Record Violated attribute city=”Redmond”, zip=77777 city and zip code should correspond
dependencies
Record Word name1= “J. Smith”, name2=”Miller P.” usually in a free-form field
type transpositions
Duplicated records emp1=(name=”John Smith”,...); same employee represented twice due to
emp2=(name=”J. Smith”,...) some data entry errors
Contradicting emp1=(name=”John Smith”, bdate=12.02.70); the same real world entity is described by
records emp2=(name=”John Smith”, bdate=12.12.70) different values
Source Wrong references emp=(name=”John Smith”, deptno=17) referenced department (17) is defined but
wrong
Table 2. Examples for single-source problems at instance level

Data Quality Issues and Current Approaches To Data Cleaning Process in Data Warehousing
No ratings yet
Data Quality Issues and Current Approaches To Data Cleaning Process in Data Warehousing
5 pages
Data Cleaning Problems and Current Approaches
No ratings yet
Data Cleaning Problems and Current Approaches
12 pages
Data Cleaning: Challenges and Solutions
No ratings yet
Data Cleaning: Challenges and Solutions
11 pages
Data Cleaning: Issues and Solutions
No ratings yet
Data Cleaning: Issues and Solutions
18 pages
Big Data (6CS030) Individual Assignment
No ratings yet
Big Data (6CS030) Individual Assignment
9 pages
UjwalBhattarai InternalAssignment
No ratings yet
UjwalBhattarai InternalAssignment
9 pages
Ijitcs V9 N3 6
No ratings yet
Ijitcs V9 N3 6
12 pages
Data Cleaning and Quality in Data Science
No ratings yet
Data Cleaning and Quality in Data Science
49 pages
Data Cleaning and JSON in R
No ratings yet
Data Cleaning and JSON in R
61 pages
Introduction To Data Cleaning
No ratings yet
Introduction To Data Cleaning
36 pages
CS194 Lec 04 Data Cleaning
No ratings yet
CS194 Lec 04 Data Cleaning
50 pages
Big Data Analysis With Apache Spark: Uc#Berkeley
No ratings yet
Big Data Analysis With Apache Spark: Uc#Berkeley
80 pages
DSF 3-4
No ratings yet
DSF 3-4
18 pages
Bi - Unit Iii
No ratings yet
Bi - Unit Iii
65 pages
Trends in Relational Data Cleaning
No ratings yet
Trends in Relational Data Cleaning
115 pages
PHMC 13 067
No ratings yet
PHMC 13 067
9 pages
Understanding the ETL Process Steps
No ratings yet
Understanding the ETL Process Steps
30 pages
DT131 Easier Approaches Harder Problems
No ratings yet
DT131 Easier Approaches Harder Problems
2 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Data Quality Concepts Overview
100% (3)
Data Quality Concepts Overview
83 pages
5 Data Cleaning
No ratings yet
5 Data Cleaning
36 pages
New DM
No ratings yet
New DM
47 pages
Data Cleaning: Challenges & Solutions
No ratings yet
Data Cleaning: Challenges & Solutions
12 pages
02-DataQuality Compressed
No ratings yet
02-DataQuality Compressed
71 pages
Project Quality Management
No ratings yet
Project Quality Management
3 pages
Data Quality and Cleaning Overview
0% (1)
Data Quality and Cleaning Overview
132 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
A Formal Definition of DQ Problems
No ratings yet
A Formal Definition of DQ Problems
14 pages
Pre Processing
No ratings yet
Pre Processing
43 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
Session2 Parts 3 4
No ratings yet
Session2 Parts 3 4
202 pages
6a - Data Quality and Data Cleaning
No ratings yet
6a - Data Quality and Data Cleaning
5 pages
Ieee Paper2
No ratings yet
Ieee Paper2
5 pages
CS822 DataMining Week3
No ratings yet
CS822 DataMining Week3
91 pages
Data Preparation
No ratings yet
Data Preparation
19 pages
Data Quality for Researchers
No ratings yet
Data Quality for Researchers
27 pages
Why Data Preprocessing
No ratings yet
Why Data Preprocessing
7 pages
Data Cleaning - Merged
No ratings yet
Data Cleaning - Merged
19 pages
Big DQ Academia
No ratings yet
Big DQ Academia
10 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
41 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
ETL Data Cleaning Techniques Explained
No ratings yet
ETL Data Cleaning Techniques Explained
6 pages
Duplicate Detection of Record Linkage in Real-World Data: K. M, P T
No ratings yet
Duplicate Detection of Record Linkage in Real-World Data: K. M, P T
10 pages
ETL Process Overview and Challenges
No ratings yet
ETL Process Overview and Challenges
32 pages
Data Transformation Challenges
No ratings yet
Data Transformation Challenges
22 pages
The Good and Bad Data: Poonam Kumari Poonamku@buffalo - Edu Oliver Kennedy Okennedy@buffalo - Edu
No ratings yet
The Good and Bad Data: Poonam Kumari Poonamku@buffalo - Edu Oliver Kennedy Okennedy@buffalo - Edu
2 pages
Data Management for Enterprises
No ratings yet
Data Management for Enterprises
15 pages
Data Quality for Business Efficiency
No ratings yet
Data Quality for Business Efficiency
64 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Data Preparation and Cleansing Techniques
No ratings yet
Data Preparation and Cleansing Techniques
196 pages
13 - Chapter 4 PDF
No ratings yet
13 - Chapter 4 PDF
46 pages
cs614 Notes
No ratings yet
cs614 Notes
2 pages
Data Cleaning and Transformation Techniques
No ratings yet
Data Cleaning and Transformation Techniques
13 pages
Overview of Data Preprocessing
No ratings yet
Overview of Data Preprocessing
4 pages
Data Analytics for CSE Students
No ratings yet
Data Analytics for CSE Students
91 pages
Data Warehousing - Architecture
No ratings yet
Data Warehousing - Architecture
6 pages
BSC Computer Syllabus
No ratings yet
BSC Computer Syllabus
53 pages
58.cse 1.1.3.
No ratings yet
58.cse 1.1.3.
45 pages
Data Mining Concepts and Techniques
No ratings yet
Data Mining Concepts and Techniques
46 pages
DWM Exp 1-2
No ratings yet
DWM Exp 1-2
9 pages
Data Warehouse Testing: An Exploratory Study
No ratings yet
Data Warehouse Testing: An Exploratory Study
90 pages
DWM PYQs
No ratings yet
DWM PYQs
7 pages
Performance Dashboard S
No ratings yet
Performance Dashboard S
76 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
11 pages
DataStage Matter
0% (1)
DataStage Matter
81 pages
200 + SQL - Server - Interview - Questions - by Shivprasad Koirala
87% (39)
200 + SQL - Server - Interview - Questions - by Shivprasad Koirala
311 pages
Week 4 - Assignment 03:: Instruction: Read and Answer The Following
No ratings yet
Week 4 - Assignment 03:: Instruction: Read and Answer The Following
3 pages
Section 1 - Design & Performance For Netezza Migration To Azure Synapse
No ratings yet
Section 1 - Design & Performance For Netezza Migration To Azure Synapse
14 pages
Importance: Marketing Information System
No ratings yet
Importance: Marketing Information System
2 pages
Sap Bw4hana Content Add On en
No ratings yet
Sap Bw4hana Content Add On en
1,420 pages
Business Analytics Class Profile
100% (1)
Business Analytics Class Profile
62 pages
Automating The Modern Data Warehouse
No ratings yet
Automating The Modern Data Warehouse
66 pages
DB2BP Warehouse Performance Monitoring 0913
No ratings yet
DB2BP Warehouse Performance Monitoring 0913
51 pages
DWM (Data Warehousing and Mining) : By: Akatsuki
No ratings yet
DWM (Data Warehousing and Mining) : By: Akatsuki
12 pages
Data Warehouse Guide for Businesses
No ratings yet
Data Warehouse Guide for Businesses
15 pages
Test Bank of Database Systems Introduction to Databases and Data Warehouses Nenad Jukic Susan Vrbsky Svetlozar Nestorov
No ratings yet
Test Bank of Database Systems Introduction to Databases and Data Warehouses Nenad Jukic Susan Vrbsky Svetlozar Nestorov
330 pages
DMW Question Paper
0% (1)
DMW Question Paper
7 pages
Data Warehousing Study Guide
100% (1)
Data Warehousing Study Guide
6 pages
Dr. Ashish Avasthi TbUP
No ratings yet
Dr. Ashish Avasthi TbUP
6 pages
ETL Validation
No ratings yet
ETL Validation
13 pages
3.data Modeling Tools
100% (1)
3.data Modeling Tools
28 pages
SAP Data Warehouse Cloud Security Guide
No ratings yet
SAP Data Warehouse Cloud Security Guide
18 pages
B.Tech CSE Syllabus at SASTRA University
No ratings yet
B.Tech CSE Syllabus at SASTRA University
18 pages
Turban Ch1 Ch6
100% (4)
Turban Ch1 Ch6
167 pages

Data Cleaning

Uploaded by

Data Cleaning

Uploaded by

data inconsistencies and the sheer data volume, data cleaning is considered to be one of the biggest problems

2 Data cleaning problems

Single-Source Problems Multi-Source Problems

Schema Level Instance Level Schema Level Instance Level

- Uniqueness - Misspellings - Naming conflicts - Inconsistent aggregating

Figure 2. Classification of data quality problems in data sources

2.1 Single-source problems

You might also like