Data Mining and Business Intelligence
Overview
Data
Data Pre-processing Cleaning
Integration
By
Dr. Nora Shoaip
Lecture 3
Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems
2023 - 2024
Quiz
Draw the Box-Plot for the following dataset
4.3, 5.1, 3.9, 4.5, 4.4, 4.9, 5.0, 4.7, 4.1, 4.6, 4.4, 4.3, 4.8,
4.4, 4.2, 4.5, 4.4
2
Quiz
21 3
Overview
Databases are highly susceptible to noisy, missing, and
inconsistent data
Low-quality data will lead to low-quality mining results
“How can the data be preprocessed in order to help improve the
quality of the data and, consequently, of the mining results?
How can the data be preprocessed so as to improve the efficiency
and ease of the mining process?”
4
Why Preprocess Data?
To satisfy the requirements of the intended use
Factors of data quality:
◦Accuracy lack of due to faulty instruments, errors caused by
human/computer/transmission, deliberate errors …
◦Completeness lack of due to different design phases, optional attributes
◦Consistency lack of due to semantics, data types, field formats …
◦Timeliness
◦Believability how much the data are trusted by users
◦Interpretability how easy the data are understood
5
Major Preprocessing Tasks
That Improve Quality of Data
Data cleaning filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving inconsistencies
Data integration include data from multiple sources in your analysis,
map semantic concepts, infer attributes …
Data reduction obtain a reduced representation of the data set that
is much smaller in volume, while producing almost the same analytical
results
Discretization raw data values for attributes are replaced by ranges
or higher conceptual levels
Data transformation normalization
6
Data Cleaning
Data in the Real World Is Dirty!
◦incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
e.g., Occupation=“ ” (missing data)
◦noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
◦inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
◦Intentional Jan. 1 as everyone’s birthday?
7
Data Cleaning
… fill in missing values, smooth out noise while identifying outliers,
and correct inconsistencies in the data
A missing value may not imply an error in the data!
◦e.g. driver’s license number
8
Data Cleaning
Missing Values
Ignore the tuple not very effective, unless the tuple contains
several attributes with missing values
Fill in the missing value manually time consuming, not
feasible for large data sets
Use a global constant replace all missing attribute values by
same value (e.g. unknown)
may mistakenly think that “unknown” is an interesting concept
9
Data Cleaning
Missing Values
Use mean or median For normal (symmetric) data
distributions, the mean is used, while skewed data distribution
should employ the median
Use mean or median for all samples belonging to the same
class as the given tuple e.g. mean or median of customers in
a certain age group
Use the most probable value using regression, inference-
based tools such as Bayesian formula or decision tree
Most popular
10
Data Cleaning
Noisy Data
Noise is a random error or variance in a measured
variable
Data smoothing techniques:
1. Binning
2. Regression
3. Outlier Analysis
11
Data Cleaning
Noisy Data
1. Binning smooth a sorted data value by consulting its
“neighborhood”
◦sorted values are partitioned into a # of “buckets,” or bins local
smoothing
◦equal-frequency bins each bin has same # of values
◦equal-width bins interval range of values per bin is constant
Smoothing by bin means each bin value is replaced by the bin mean
Smoothing by bin medians each bin value is replaced by the bin median
Smoothing by bin boundaries each bin value is replaced by the closest
boundary value (min & max in a bin are bin boundaries)
12
Data Cleaning
Partition into (equal-
Noisy Data frequency) bins
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Example: Sorted data for price (in dollars): Bin 3: 25, 28, 34
4, 8, 15, 21, 21, 24, 25, 28, 34 Smoothing by bin means
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
13
Data Cleaning
Noisy Data Partition into (equal-width)
bins
Bin 1: 4, 8, 15
Example: Sorted data for price (in dollars): Bin 2: 21, 21, 24, 25, 28
4, 8, 15, 21, 21, 24, 25, 28, 34 Bin 3: 34
Smoothing by bin means
Bin 1: 9, 9, 9
Bin 2: 24, 24, 24,24,24
Bin 3: 34
Smoothing by bin boundaries
Bin 1: 4, 4, 15
Bin 2: 21, 21, 21, 28, 28
Bin 3: 34
14
Data Cleaning
Noisy Data
2. Regression Conform data values to a function
◦Linear regression find “best” line to fit two attributes so that one
attribute can be used to predict the other
3. Outlier Analysis
Potter’s Wheel Automated interactive data
cleaning tool
15
Data Integration
Entity Identification Problem
Redundancy and correlation analysis
Tuple duplication
Tuple duplication
Data value conflict detection
16
Data Integration
Merging data from multiple data stores
Helps reduce and avoid redundancies and inconsistencies in the resulting data set
Challenges:
Semantic heterogeneity entity identification problem
Structure of data functional dependencies and referential constraints
Redundancy
17
Data Integration
Entity Identification Problem
Schema integration and object matching
Metadata name, meaning, data type, and range of values
permitted, null rules for handling blank, zero, or null values
can help avoid errors in schema integration and data
transformation
18
Data Integration
Redundancy and Correlation Analysis
19
Data Integration
Redundancy and Correlation Analysis
gender
male female Total
Fiction 250 200 450
Preferred Non-fiction 50 1000 1050
reading
Total 300 1200 1500
20
Data Integration
Redundancy and Correlation Analysis
gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500
21
Data Integration
Redundancy and Correlation Analysis
gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500
22
Data Integration
Redundancy and Correlation Analysis
23
Data Integration
Redundancy and Correlation Analysis
24
Data Integration
Redundancy and Correlation Analysis
25
Data Integration
Redundancy and Correlation Analysis
Time AllElectronics HighTech
point
T1 6 20
T2 5 10
T3 4 14
T4 3 5
T5 2 5
26
Data Integration
More Issues
Tuple duplication
The use of denormalized tables (often done to improve performance by
avoiding joins) is another source of data redundancy.
e.g. purchaser name and address, and purchases
Data value conflict
e.g. grading system in two different institutes A, B, … versus 90%,
80% …
27
Quiz
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
• Calculate the correlation coefficient. Are these two attributes positively or negatively
correlated? Compute their covariance.
o (Hint: n = 18
o SD for Age and fat are 12.85 and 8.99 respectively
o Mean for Age and fat are 46.44 and 28.78 respectively
o E(age* fat) = 1431.29)
• Partition the data into three bins by each of equal-frequency and equal-width partitioning
• Use smoothing by bin boundaries to smooth these data
28
Quiz.. Sol.
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
29
Quiz.. Sol.
Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
• Equal frequency bin for age Equal width bin for age
o Bin 1= 23,23,27,27,39,41 o Bin 1= 23,23,27,27
o Bin 2= 47,49,50,52,54,54 o Bin 2= 39,41,47,49
o Bin 3= 50,52,54,54, 56, 57,58,58,60,61
o Bin 3= 56,57,58,58,60,61
Smoothing by boundary
• Smoothing by boundary o Bin 1= 23,23,27,27
o Bin 1= 23,23,23,23,41,41 o Bin 2= 39,41,47,49
o Bin 2= 47,47,47,54,54,54 o Bin 3= 50,50,50,50, 50, 61,61,61,61,61
o Bin 3= 56,56,56,56,61,61
30