data inconsistencies and the sheer data volume, data cleaning is considered to be one of the biggest problems
in data warehousing. During the so-called ETL process (extraction, transformation, loading), illustrated in
Fig. 1, further data transformations deal with schema/data translation and integration, and with filtering and
aggregating data to be stored in the warehouse. As indicated in Fig. 1, all data cleaning is typically
performed in a separate data staging area before loading the transformed data into the warehouse. A large
number of tools of varying functionality is available to support these tasks, but often a significant portion of
the cleaning and transformation work has to be done manually or by low-level programs that are difficult to
write and maintain.
Federated database systems and web-based information systems face data transformation steps similar to
those of data warehouses. In particular, there is typically a wrapper per data source for extraction and a
mediator for integration [32][31]. So far, these systems provide only limited support for data cleaning,
focusing instead on data transformations for schema translation and schema integration. Data is not
preintegrated as for data warehouses but needs to be extracted from multiple sources, transformed and
combined during query runtime. The corresponding communication and processing delays can be significant,
making it difficult to achieve acceptable response times. The effort needed for data cleaning during
extraction and integration will further increase response times but is mandatory to achieve useful query
results.
A data cleaning approach should satisfy several requirements. First of all, it should detect and remove all
major errors and inconsistencies both in individual data sources and when integrating multiple sources. The
approach should be supported by tools to limit manual inspection and programming effort and be extensible
to easily cover additional sources. Furthermore, data cleaning should not be performed in isolation but
together with schema-related data transformations based on comprehensive metadata. Mapping functions for
data cleaning and other data transformations should be specified in a declarative way and be reusable for
other data sources as well as for query processing. Especially for data warehouses, a workflow infrastructure
should be supported to execute all data transformation steps for multiple sources and large data sets in a
reliable and efficient way.
While a huge body of research deals with schema translation and schema integration, data cleaning has
received only little attention in the research community. A number of authors focussed on the problem of
duplicate identification and elimination, e.g., [11][12][15][19][22][23]. Some research groups concentrate on
general problems not limited but relevant to data cleaning, such as special data mining approaches [30][29],
and data transformations based on schema matching [1][21]. More recently, several research efforts propose
and investigate a more comprehensive and uniform treatment of data cleaning covering several
transformation phases, specific operators and their implementation [11][19][25].
In this paper we provide an overview of the problems to be addressed by data cleaning and their solution. In
the next section we present a classification of the problems. Section 3 discusses the main cleaning
approaches used in available tools and the research literature. Section 4 gives an overview of commercial
tools for data cleaning, including ETL tools. Section 5 is the conclusion.
2 Data cleaning problems
This section classifies the major data quality problems to be solved by data cleaning and data transformation.
As we will see, these problems are closely related and should thus be treated in a uniform way. Data
transformations [26] are needed to support any changes in the structure, representation or content of data.
These transformations become necessary in many situations, e.g., to deal with schema evolution, migrating a
legacy system to a new information system, or when multiple data sources are to be integrated.
As shown in Fig. 2 we roughly distinguish between single-source and multi-source problems and between
schema- and instance-related problems. Schema-level problems of course are also reflected in the instances;
they can be addressed at the schema level by an improved schema design (schema evolution), schema
translation and schema integration. Instance-level problems, on the other hand, refer to errors and
inconsistencies in the actual data contents which are not visible at the schema level. They are the primary
focus of data cleaning. Fig. 2 also indicates some typical problems for the various cases. While not shown in
Fig. 2, the single-source problems occur (with increased likelihood) in the multi-source case, too, besides
specific multi-source problems.
2
Data Quality Problems
Single-Source Problems Multi-Source Problems
Schema Level Instance Level Schema Level Instance Level
(Lack of integrity (Data entry errors) (Heterogeneous (Overlapping,
constraints, poor data models and contradicting and
schema design) schema designs) inconsistent data)
- Uniqueness - Misspellings - Naming conflicts - Inconsistent aggregating
- Referential integrity - Redundancy/duplicates - Structural conflicts - Inconsistent timing
… - Contradictory values … …
…
Figure 2. Classification of data quality problems in data sources
2.1 Single-source problems
The data quality of a source largely depends on the degree to which it is governed by schema and integrity
constraints controlling permissable data values. For sources without schema, such as files, there are few
restrictions on what data can be entered and stored, giving rise to a high probability of errors and
inconsistencies. Database systems, on the other hand, enforce restrictions of a specific data model (e.g., the
relational approach requires simple attribute values, referential integrity, etc.) as well as application-specific
integrity constraints. Schema-related data quality problems thus occur because of the lack of appropriate
model-specific or application-specific integrity constraints, e.g., due to data model limitations or poor
schema design, or because only a few integrity constraints were defined to limit the overhead for integrity
control. Instance-specific problems relate to errors and inconsistencies that cannot be prevented at the
schema level (e.g., misspellings).
Scope/Problem Dirty Data Reasons/Remarks
Attribute Illegal values bdate=30.13.70 values outside of domain range
Record Violated attribute age=22, bdate=12.02.70 age = (current date – birth date)
dependencies should hold
Record Uniqueness emp1=(name=”John Smith”, SSN=”123456”) uniqueness for SSN (social security
type violation emp2=(name=”Peter Miller”, SSN=”123456”) number) violated
Source Referential emp=(name=”John Smith”, deptno=127) referenced department (127) not defined
integrity violation
Table 1. Examples for single-source problems at schema level (violated integrity constraints)
For both schema- and instance-level problems we can differentiate different problem scopes: attribute (field),
record, record type and source; examples for the various cases are shown in Tables 1 and 2. Note that
uniqueness constraints specified at the schema level do not prevent duplicated instances, e.g., if information
on the same real world entity is entered twice with different attribute values (see example in Table 2).
Scope/Problem Dirty Data Reasons/Remarks
Attribute Missing values phone=9999-999999 unavailable values during data entry
(dummy values or null)
Misspellings city=”Liipzig” usually typos, phonetic errors
Cryptic values, experience=”B”;
Abbreviations occupation=”DB Prog.”
Embedded values name=”J. Smith 12.02.70 New York” multiple values entered in one attribute
(e.g. in a free-form field)
Misfielded values city=”Germany”
Record Violated attribute city=”Redmond”, zip=77777 city and zip code should correspond
dependencies
Record Word name1= “J. Smith”, name2=”Miller P.” usually in a free-form field
type transpositions
Duplicated records emp1=(name=”John Smith”,...); same employee represented twice due to
emp2=(name=”J. Smith”,...) some data entry errors
Contradicting emp1=(name=”John Smith”, bdate=12.02.70); the same real world entity is described by
records emp2=(name=”John Smith”, bdate=12.12.70) different values
Source Wrong references emp=(name=”John Smith”, deptno=17) referenced department (17) is defined but
wrong
Table 2. Examples for single-source problems at instance level