Data Mining and Business Intelligence
Integration
Reduction
Data Pre-processing
Transformation
By
Dr. Nora Shoaip
Lecture 3
Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems
2024 - 2025
Data Integration
• Entity Identification Problem
• Redundancy and correlation analysis
• Tuple duplication
Data Integration
Merging data from multiple data stores
Helps reduce and avoid redundancies and inconsistencies in the resulting data set
Challenges:
Semantic heterogeneity entity identification problem
Structure of data functional dependencies and referential constraints
Redundancy
3
Data Integration
Entity Identification Problem
Schema integration and object matching
Metadata name, meaning, data type, and range of values
permitted, null rules for handling blank, zero, or null values
can help avoid errors in schema integration and data
transformation
4
Data Integration
Redundancy and Correlation Analysis
5
Data Integration
Redundancy and Correlation Analysis
gender
male female Total
Fiction 250 200 450
Preferred Non-fiction 50 1000 1050
reading
Total 300 1200 1500
6
Data Integration
Redundancy and Correlation Analysis
gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500
7
Data Integration
Redundancy and Correlation Analysis
gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500
8
Data Integration
Redundancy and Correlation Analysis
9
Data Integration
Redundancy and Correlation Analysis
10
Data Integration
Redundancy and Correlation Analysis
11
Data Integration
Redundancy and Correlation Analysis
Time AllElectronics HighTech
point
T1 6 20
T2 5 10
T3 4 14
T4 3 5
T5 2 5
12
Data Integration
More Issues
Tuple duplication
The use of denormalized tables (often done to improve performance by
avoiding joins) is another source of data redundancy.
e.g. purchaser name and address, and purchases
Data value conflict
e.g. grading system in two different institutes A, B, … versus 90%,
80% …
13
Data Reduction
• Wavelet transforms
• PCA
• Attribute subset selection
• Regression
• Histograms
• Clustering
• Sampling
Data Reduction
Strategies
Dimensionality reduction reduce number of attributes
◦Wavelet transforms, PCA, Attribute subset selection
Numerosity reduction replace original data volume by smaller data representation
◦Parametric a model is used to estimate the data - only the data parameters are
stored
Regression
◦Nonparametric store reduced representations of the data
Histograms, clustering, sampling
Compression transformations applied to obtain a “compressed” representation of
original data
◦Lossless, Lossy
15
Data Reduction
Attribute Subset Selection
find a min set of attributes such that the resulting probability distribution of data is as
close as possible to the original distribution using all attributes
An exhaustive search can be prohibitively expensive
Heuristic (Greedy) search
◦Stepwise forward selection: start with empty set of attributes as reduced set. The best of the
attributes is determined and added to the reduced set. At each subsequent iteration, the best of
the remaining attributes is added to the set
◦Stepwise backward elimination: start with the full set of attributes. At each step, remove the
worst attribute remaining in the set
◦Combination of forward selection and backward elimination
◦Decision tree induction
Attribute construction e.g. area attribute based on height and width attributes
16
Data Reduction
Attribute Subset
Selection
17
Data Reduction- Numerosity reduction
Regression
Data is modeled to fit a straight line
A random variable y (response variable), can be modeled
as a linear function of another random variable x
(predictor variable)
Regression line equation y = wx + b
w and b are regression coefficients they specify the
slope of the line and y-intercept
Solved for by the method of least squares minimize error
between actual line separating data and estimate of the
line (best-fitting line)
18
Data Reduction
Regression
X Y
1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25
19
Data Reduction
Regression
X Y
1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25
20
Data Reduction
Histograms
A histogram for an attribute, A, partitions the data distribution of A into disjoint
subsets, referred to as buckets or bins.
a single attribute–value/frequency pair singleton buckets
Often, buckets represent continuous ranges for the given attribute.
Equal-width: the width of each bucket range is uniform (e.g., the width of $10 for the
buckets).
Equal-frequency (or equal-depth): roughly, the frequency of each bucket is constant
(i.e., each bucket contains roughly the same number of contiguous data samples).
21
Data Reduction
Histograms
The following data are a list of AllElectronics
prices for commonly sold items (rounded to the
nearest dollar). The numbers have been sorted:
1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14,
14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,18,
18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21,
21, 25, 25, 25, 25, 25, 28, 28, 30,30, 30.
22
Data Reduction
Sampling
A large data set represented by a smaller random data sample
Simple random sample without replacement (SRSWOR) of size s draw s of the N
tuples (s < N)
◦all tuples are equally likely to be sampled
Simple random sample with replacement (SRSWR) of size s similar to SRSWOR,
but each time a tuple is drawn, it’s recorded then placed back so it may be drawn again
Cluster sample If tuples are grouped into M “clusters,” an SRS of s clusters can be
obtained
Stratified sample If tuples are divided into strata, a stratified sample is generated by
obtaining an SRS at each stratum
◦e.g. stratum is created for each customer age group
23
Data Reduction
Sampling
24
Transformation and Discretization
Transformation Strategies
Smoothing binning, regression
Attribute construction
Aggregation
Normalization raw values of a numeric attribute (e.g. age) replaced by interval
labels (e.g. 0–10, 11–20) or conceptual labels (e.g., youth, adult, senior)
Concept hierarchy e.g. street generalized to higher-level concepts (city or country)
25
Transformation and Discretization
Transformation by Normalization
To help avoid dependence on the choice of measurement units
Give all attributes equal weight
Methods:
min-max normalization
z-score normalization
26
Transformation and Discretization
Transformation by Normalization
27
Transformation and Discretization
Transformation by Normalization
28
Transformation and Discretization
Concept Hierarchy
Concept hierarchy organizes concepts (i.e., attribute values) hierarchically
Concept hierarchies facilitate drilling and rolling to view data in multiple
granularity
Concept hierarchy formation: Recursively reduce data by collecting and
replacing low level concepts (e.g. age values) by higher level concepts (e.g.
age groups: youth, adult, or senior)
Concept hierarchies can be explicitly specified by domain experts
Concept hierarchy can be automatically formed for both numeric and nominal
data discretization
29
Transformation and Discretization
Concept Hierarchy
For nominal data:
Specification of a partial ordering of attributes explicitly at the schema level by
users or experts
street, city, province or state, country street < city < province or state < country
Specification of a set of attributes, but not of their partial ordering order
automatically generated by system
e.g. Location country contains a smaller #distinct values than street
automatically generate concept hierarchy based on # distinct values per attribute in the
given attribute set
Not for all concepts! Time year (20), month (12), day of week (7)
30
Summary
Cleaning Integration Reduction Transformation/Discretization
Binning Binning
Regression Regression Regression
Correlation analysis Correlation
Histograms Histogram analysis
Clustering Clustering
Attribute construction Attribute construction
Aggregation
Normalization
Outlier analysis
Wavelet transforms
PCA
Attribute subset selection
Sampling
Concept hierarchy
31
Summary
21