0% found this document useful (0 votes)

21 views33 pages

Data Mining - Lecture 3

Uploaded by

hendymostafa256

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views33 pages

Data Mining - Lecture 3

Uploaded by

hendymostafa256

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Data Mining and Business Intelligence

Integration

Reduction
Data Pre-processing
Transformation

By
Dr. Nora Shoaip
Lecture 3

Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems

2024 - 2025
Data Integration
• Entity Identification Problem
• Redundancy and correlation analysis
• Tuple duplication
Data Integration

Merging data from multiple data stores

Helps reduce and avoid redundancies and inconsistencies in the resulting data set
Challenges:
 Semantic heterogeneity  entity identification problem
 Structure of data  functional dependencies and referential constraints
 Redundancy

3
Data Integration
Entity Identification Problem

 Schema integration and object matching

 Metadata  name, meaning, data type, and range of values
permitted, null rules for handling blank, zero, or null values

 can help avoid errors in schema integration and data

transformation

4
Data Integration
Redundancy and Correlation Analysis

5
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 200 450
Preferred Non-fiction 50 1000 1050
reading
Total 300 1200 1500

6
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500

7
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500

8
Data Integration
Redundancy and Correlation Analysis

9
Data Integration
Redundancy and Correlation Analysis

10
Data Integration
Redundancy and Correlation Analysis

11
Data Integration
Redundancy and Correlation Analysis

Time AllElectronics HighTech

point

T1 6 20
T2 5 10
T3 4 14
T4 3 5
T5 2 5

12
Data Integration
More Issues

Tuple duplication
The use of denormalized tables (often done to improve performance by
avoiding joins) is another source of data redundancy.
e.g. purchaser name and address, and purchases
Data value conflict
e.g. grading system in two different institutes  A, B, … versus 90%,
80% …

13
Data Reduction
• Wavelet transforms
• PCA
• Attribute subset selection
• Regression
• Histograms
• Clustering
• Sampling
Data Reduction
Strategies

 Dimensionality reduction  reduce number of attributes

◦Wavelet transforms, PCA, Attribute subset selection
 Numerosity reduction  replace original data volume by smaller data representation
◦Parametric  a model is used to estimate the data - only the data parameters are
stored
Regression
◦Nonparametric  store reduced representations of the data
Histograms, clustering, sampling
 Compression  transformations applied to obtain a “compressed” representation of
original data
◦Lossless, Lossy

15
Data Reduction
Attribute Subset Selection

 find a min set of attributes such that the resulting probability distribution of data is as
close as possible to the original distribution using all attributes
 An exhaustive search can be prohibitively expensive
 Heuristic (Greedy) search
◦Stepwise forward selection: start with empty set of attributes as reduced set. The best of the
attributes is determined and added to the reduced set. At each subsequent iteration, the best of
the remaining attributes is added to the set
◦Stepwise backward elimination: start with the full set of attributes. At each step, remove the
worst attribute remaining in the set
◦Combination of forward selection and backward elimination
◦Decision tree induction
 Attribute construction  e.g. area attribute based on height and width attributes
16
Data Reduction
Attribute Subset
Selection

17
Data Reduction- Numerosity reduction
Regression

 Data is modeled to fit a straight line

 A random variable y (response variable), can be modeled
as a linear function of another random variable x
(predictor variable)
Regression line equation  y = wx + b
 w and b are regression coefficients  they specify the
slope of the line and y-intercept
 Solved for by the method of least squares minimize error
between actual line separating data and estimate of the
line (best-fitting line)

18
Data Reduction
Regression

X Y

1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25

19
Data Reduction
Regression

X Y

1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25

20
Data Reduction
Histograms

 A histogram for an attribute, A, partitions the data distribution of A into disjoint

subsets, referred to as buckets or bins.

 a single attribute–value/frequency pair singleton buckets

 Often, buckets represent continuous ranges for the given attribute.

 Equal-width: the width of each bucket range is uniform (e.g., the width of $10 for the
buckets).

 Equal-frequency (or equal-depth): roughly, the frequency of each bucket is constant

(i.e., each bucket contains roughly the same number of contiguous data samples).

21
Data Reduction
Histograms

The following data are a list of AllElectronics

prices for commonly sold items (rounded to the
nearest dollar). The numbers have been sorted:
1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14,
14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,18,
18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21,
21, 25, 25, 25, 25, 25, 28, 28, 30,30, 30.

22
Data Reduction
Sampling

 A large data set represented by a smaller random data sample

 Simple random sample without replacement (SRSWOR) of size s  draw s of the N
tuples (s < N)
◦all tuples are equally likely to be sampled
 Simple random sample with replacement (SRSWR) of size s  similar to SRSWOR,
but each time a tuple is drawn, it’s recorded then placed back so it may be drawn again
 Cluster sample  If tuples are grouped into M “clusters,” an SRS of s clusters can be
obtained
 Stratified sample  If tuples are divided into strata, a stratified sample is generated by
obtaining an SRS at each stratum
◦e.g. stratum is created for each customer age group
23
Data Reduction
Sampling

24
Transformation and Discretization
Transformation Strategies

 Smoothing  binning, regression

 Attribute construction

 Aggregation

 Normalization  raw values of a numeric attribute (e.g. age) replaced by interval

labels (e.g. 0–10, 11–20) or conceptual labels (e.g., youth, adult, senior)

 Concept hierarchy  e.g. street generalized to higher-level concepts (city or country)

25
Transformation and Discretization
Transformation by Normalization

To help avoid dependence on the choice of measurement units

Give all attributes equal weight
Methods:
min-max normalization
z-score normalization

26
Transformation and Discretization
Transformation by Normalization

27
Transformation and Discretization
Transformation by Normalization

28
Transformation and Discretization
Concept Hierarchy

 Concept hierarchy organizes concepts (i.e., attribute values) hierarchically

 Concept hierarchies facilitate drilling and rolling to view data in multiple
granularity
 Concept hierarchy formation: Recursively reduce data by collecting and
replacing low level concepts (e.g. age values) by higher level concepts (e.g.
age groups: youth, adult, or senior)
 Concept hierarchies can be explicitly specified by domain experts
 Concept hierarchy can be automatically formed for both numeric and nominal
data  discretization
29
Transformation and Discretization
Concept Hierarchy

For nominal data:

Specification of a partial ordering of attributes explicitly at the schema level by
users or experts
street, city, province or state, country  street < city < province or state < country
Specification of a set of attributes, but not of their partial ordering  order
automatically generated by system
e.g. Location  country contains a smaller #distinct values than street
automatically generate concept hierarchy based on # distinct values per attribute in the
given attribute set
Not for all concepts! Time  year (20), month (12), day of week (7)

30
Summary
Cleaning Integration Reduction Transformation/Discretization
Binning Binning
Regression Regression Regression
Correlation analysis Correlation
Histograms Histogram analysis
Clustering Clustering
Attribute construction Attribute construction
Aggregation
Normalization
Outlier analysis
Wavelet transforms
PCA
Attribute subset selection
Sampling
Concept hierarchy
31
Summary

Data Mining
No ratings yet
Data Mining
21 pages
DM Data Transformation Techniques
No ratings yet
DM Data Transformation Techniques
25 pages
Lec07 - Data-Preprocessing-18052023-082951pm 2
No ratings yet
Lec07 - Data-Preprocessing-18052023-082951pm 2
32 pages
Data Reduction and Integration Techniques
No ratings yet
Data Reduction and Integration Techniques
21 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
Week 2
No ratings yet
Week 2
96 pages
03preprocessing3 Part3 4
No ratings yet
03preprocessing3 Part3 4
49 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
45 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Dpir Ia1
No ratings yet
Dpir Ia1
13 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Preprocessing-2
No ratings yet
Data Preprocessing-2
30 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Power Bi 4
No ratings yet
Power Bi 4
20 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
CSC 452 DM Week05 Data PreProcessing B 13102020 015718pm
No ratings yet
CSC 452 DM Week05 Data PreProcessing B 13102020 015718pm
50 pages
CH 3-Final
No ratings yet
CH 3-Final
39 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Pre-Processing Essentials
No ratings yet
Data Pre-Processing Essentials
21 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
15 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Data Mining - Data Reduction
No ratings yet
Data Mining - Data Reduction
6 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
Data Preparation Modeling Evaluation
No ratings yet
Data Preparation Modeling Evaluation
145 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Unit 3
No ratings yet
Unit 3
164 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Unit I
No ratings yet
Unit I
57 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
DP
No ratings yet
DP
44 pages
Ch2 Data Preprocessing Part3: Amit KR Upadhyay Sharda University
No ratings yet
Ch2 Data Preprocessing Part3: Amit KR Upadhyay Sharda University
24 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
3-Data Fundamentals For BI - Part2
No ratings yet
3-Data Fundamentals For BI - Part2
44 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
8 pages
Cs Project
No ratings yet
Cs Project
22 pages
Business Intelligence Architecture
No ratings yet
Business Intelligence Architecture
13 pages
Relational Algebra: Database Principles
No ratings yet
Relational Algebra: Database Principles
45 pages
Instructor:: Semester Project Mam. Yella Mehroze
No ratings yet
Instructor:: Semester Project Mam. Yella Mehroze
7 pages
1 Design A Metrics Reporting System
No ratings yet
1 Design A Metrics Reporting System
5 pages
Java File
No ratings yet
Java File
24 pages
BI Analytics Overview
No ratings yet
BI Analytics Overview
60 pages
Topic: Cloud Computing: Taller
No ratings yet
Topic: Cloud Computing: Taller
2 pages
Managing Foreign Keys in MD_FK_REF
No ratings yet
Managing Foreign Keys in MD_FK_REF
2 pages
Pruthviraj Resume
No ratings yet
Pruthviraj Resume
1 page
Udemy Notes For AI-900
No ratings yet
Udemy Notes For AI-900
88 pages
CV - Junior Java Back-End Developer
No ratings yet
CV - Junior Java Back-End Developer
1 page
Extend MDG-F Data Model 0G Guide
No ratings yet
Extend MDG-F Data Model 0G Guide
52 pages
Resumedata Engineer
No ratings yet
Resumedata Engineer
3 pages
Exploring Deep Learning-Based Visual Localization Techniques For UAVs in GPS-Denied Environments
No ratings yet
Exploring Deep Learning-Based Visual Localization Techniques For UAVs in GPS-Denied Environments
23 pages
How To Deploy Ethereum On Windows v51
No ratings yet
How To Deploy Ethereum On Windows v51
28 pages
How To Schedule Backups in DBACOCKPIT SAP ASE For Business
No ratings yet
How To Schedule Backups in DBACOCKPIT SAP ASE For Business
3 pages
Discrete Math
No ratings yet
Discrete Math
7 pages
Using ArcGIS Military Analyst
No ratings yet
Using ArcGIS Military Analyst
89 pages
SAP CRM Systems Engineer Resume
No ratings yet
SAP CRM Systems Engineer Resume
6 pages
AI System Design: Data Acquisition & Models
No ratings yet
AI System Design: Data Acquisition & Models
1 page
CCD Tanish Micro
No ratings yet
CCD Tanish Micro
24 pages
Views
No ratings yet
Views
6 pages
Dbms Module 2 Chapter 8
No ratings yet
Dbms Module 2 Chapter 8
93 pages
MTH100 ShortNotes FinalTerm by Vu Topper RM
No ratings yet
MTH100 ShortNotes FinalTerm by Vu Topper RM
15 pages
SQL Server 2008 R2 Developer Course
No ratings yet
SQL Server 2008 R2 Developer Course
1 page
Slides
No ratings yet
Slides
26 pages
SVM 10 Install Instructions
No ratings yet
SVM 10 Install Instructions
62 pages
Spatiotemporal Data Warehouse Schema
No ratings yet
Spatiotemporal Data Warehouse Schema
7 pages
BDA Obj Questions
No ratings yet
BDA Obj Questions
9 pages

Data Mining - Lecture 3

Uploaded by

Data Mining - Lecture 3

Uploaded by

Data Mining and Business Intelligence

Merging data from multiple data stores

 Schema integration and object matching

 can help avoid errors in schema integration and data

Time AllElectronics HighTech

 Dimensionality reduction  reduce number of attributes

 Data is modeled to fit a straight line

 A histogram for an attribute, A, partitions the data distribution of A into disjoint

 a single attribute–value/frequency pair singleton buckets

 Often, buckets represent continuous ranges for the given attribute.

 Equal-frequency (or equal-depth): roughly, the frequency of each bucket is constant

The following data are a list of AllElectronics

 A large data set represented by a smaller random data sample

 Smoothing  binning, regression

 Normalization  raw values of a numeric attribute (e.g. age) replaced by interval

 Concept hierarchy  e.g. street generalized to higher-level concepts (city or country)

To help avoid dependence on the choice of measurement units

 Concept hierarchy organizes concepts (i.e., attribute values) hierarchically

For nominal data:

You might also like