Chapter 3: Data Preprocessing
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation and Data Discretization
◼ Summary
1
Data Quality: Why Preprocess the Data?
◼ Measures for data quality: A multidimensional view
◼ Accuracy: correct or wrong, accurate or not
◼ Completeness: not recorded, unavailable, …
◼ Consistency: some modified but some not, dangling, …
◼ Timeliness: timely update?
◼ Believability: how trustable the data are correct?
◼ Interpretability: how easily the data can be
understood?
2
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation
3
Chapter 3: Data Preprocessing
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation and Data Discretization
◼ Summary
4
Data Cleaning
◼ Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
◼ incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
◼ e.g., Occupation=“ ” (missing data)
◼ noisy: containing noise, errors, or outliers
◼ e.g., Salary=“−10” (an error)
◼ inconsistent: containing discrepancies in codes or names, e.g.,
◼ Age=“42”, Birthday=“03/07/2010”
◼ Was rating “1, 2, 3”, now rating “A, B, C”
◼ discrepancy between duplicate records
◼ Intentional (e.g., disguised missing data)
◼ Jan. 1 as everyone’s birthday?
5
Incomplete (Missing) Data
◼ Data is not always available
◼ E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
◼ Missing data may be due to
◼ equipment malfunction
◼ inconsistent with other recorded data and thus deleted
◼ data not entered due to misunderstanding
◼ certain data may not be considered important at the
time of entry
◼ not register history or changes of the data
◼ Missing data may need to be inferred
6
How to Handle Missing Data?
◼ Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
◼ Fill in the missing value manually: tedious + infeasible?
◼ Fill in it automatically with
◼ a global constant : e.g., “unknown”, a new class?!
◼ the attribute mean
◼ the attribute mean for all samples belonging to the
same class: smarter
◼ the most probable value: inference-based such as
Bayesian formula or decision tree
7
Noisy Data
◼ Noise: random error or variance in a measured variable
◼ Incorrect attribute values may be due to
◼ faulty data collection instruments
◼ data entry problems
◼ data transmission problems
◼ technology limitation
◼ inconsistency in naming convention
◼ Other data problems which require data cleaning
◼ duplicate records
◼ incomplete data
◼ inconsistent data
8
How to Handle Noisy Data?
◼ Binning
◼ first sort data and partition into (equal-frequency) bins
◼ then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
◼ Regression
◼ smooth by fitting the data into regression functions
◼ Clustering
◼ detect and remove outliers
◼ Combined computer and human inspection
◼ detect suspicious values and check by human (e.g.,
deal with possible outliers)
9
Data Cleaning as a Process
◼ Data discrepancy detection
◼ Use metadata (e.g., domain, range, dependency, distribution)
◼ Check field overloading
◼ Check uniqueness rule, consecutive rule and null rule
◼ Use commercial tools
◼ Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
◼ Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering
to find outliers)
◼ Data migration and integration
◼ Data migration tools: allow transformations to be specified
◼ ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
◼ Integration of the two processes
◼ Iterative and interactive (e.g., Potter’s Wheels)
10
Chapter 3: Data Preprocessing
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation and Data Discretization
◼ Summary
11
Data Integration
◼ Data integration:
◼ Combines data from multiple sources into a coherent store
◼ Schema integration: e.g., A.cust-id B.cust-#
◼ Integrate metadata from different sources
◼ Entity identification problem:
◼ Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
◼ Detecting and resolving data value conflicts
◼ For the same real world entity, attribute values from different
sources are different
◼ Possible reasons: different representations, different scales, e.g.,
metric vs. British units
12
Handling Redundancy in Data Integration
◼ Redundant data occur often when integration of multiple
databases
◼ Object identification: The same attribute or object
may have different names in different databases
◼ Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
◼ Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
◼ Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
13
Correlation Analysis (Numeric Data)
◼ Correlation coefficient (also called Pearson’s product
moment coefficient)
i=1 (ai − A)(bi − B)
n n
(ai bi ) − n AB
rA, B = = i =1
(n − 1) A B (n − 1) A B
where n is the number of tuples, A and B are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
◼ If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s).
◼ rA,B = 0: independent; rAB < 0: negatively correlated
14
2/12/2025 Data Mining: Concepts and Techniques 15
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
16
Correlation (viewed as linear relationship)
◼ Correlation measures the linear relationship
between objects
◼ To compute correlation, we standardize data
objects, A and B, and then take their dot product
a 'k = (ak − mean( A)) / std ( A)
b'k = (bk − mean( B )) / std ( B)
correlation( A, B) = A'• B '
17
Covariance (Numeric Data)
◼ Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples, A and B are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
◼ Positive covariance: If CovA,B > 0, then A and B both tend to be larger
than their expected values.
◼ Negative covariance: If CovA,B < 0 then if A is larger than its expected
value, B is likely to be smaller than its expected value.
◼ Independence: CovA,B = 0 but the converse is not true:
◼ Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence18
Co-Variance: An Example
◼ It can be simplified in computation as
◼ Suppose two stocks A and B have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
◼ Question: If the stocks are affected by the same industry trends, will
their prices rise or fall together?
◼ E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
◼ E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
◼ Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
◼ Thus, A and B rise together since Cov(A, B) > 0.